The first part of this Rmd will be notes and exercise from the lecture series on the book Intro to Statistical Learning. The Second part of this Rmd document will contain notes and examples from the Book itself. ***

1 Chapter 1 Introduction


1.1 Opening Remarks


1.1.1 The Supervised Learning Problem

Starting Point

  • Outcome measurement \(Y\) (also called dependent variable, reponse,target).
  • Vector of \(p\) predictor measurement \(X\) (also called inputs, regressors, covariates, features, independent variables).
  • In the regression problems, \(Y\) is quantitative (e.g. price, blood pressure).
  • In the Classification problem, \(Y\) takes values in a finite, unordered set (survived/died, digit 0-9, cancer class of tissue sample).
  • We have training data \((x_{1},y{1}),...,(x_{N},y_{N})\). These are observations (examples, instances) of these measurements.

1.1.2 Objectives

On the basis of the training data we would like to:
* Accurately predict unseen test cases.
* Understand which inputs affect the outcome, and how.
* Assess the quality of our predictions and inferences.

1.2 Examples and Frameworks


1.2.1 Unsupervised Learning

  • No outcome variablem just a set of predictors (featurs) measured on a set of samples.
  • objective is more fuzzy -- find groups of samples that behave similarly, find features that beghave similarly, find linear combinations of features witht the most variation.
  • difficult to know how well your doing.
  • different from supervised learning, but can be useful as a pre-processing step for supervised learning.

1.2.2 Statistical Learning versus Machine Learning

  • Machine learning arose as a subfield of Artificial Intelligence
  • Statistical learning arose as a subfield of Statistics
  • There is much overlap -- both fields focus on supervised and unsupervised problems:
    • Machine learning has a greater emphasis on large scale applications and prediction accuracy.
    • Statistical learning emphasizes models and their interpretability, and precision and uncertainty.
  • But the distinction has become more and more blurred, and there is a great deal of "cross-fertilization".
  • Machine learning has the upper hand in Marketing!

2 Chapter 2 Overview of Statistical Learning


2.1 Introduction to Regression Models


What is Statistical Learning?

expected value is a fancy word for the average value

Example of statistical learning: Can we predict Sales using these three variables?

\[ Sales \approx f(TV,Radio,Newspaper) \]

2.1.1 Notation

Here Sales is a response or target that we wish to predict. We generically refer to the response as \(Y\). TV is a feature, or input, or predictor, we name it \(X_{1}\). Likewise name Radio as \(X_{2}\), and so on. We can refer to the input vector collectively as:
\[ X = \begin{pmatrix} X_{1} \\ X_{2} \\ X_{3} \end{pmatrix} \] Now we write our model as:
\[ Y = f(X) + \epsilon \]

where \(\epsilon\) captures measurements errors and other discrepancies.

2.1.2 What is \(f(X)\) good for?

  • With a good \(f\) we can make predictions of \(Y\) at new points \(X=x\).
  • We can understand which components of \(X=(X_{1},X_{2},...,X_{p})\) are important in explaining \(Y\), and which are irrelevant. e.g. Seniority and Years of Education have a big impact on Income, but Marital Status typically does not.
  • Depending on the complexity of \(f\), we may be able to understand how each component \(X_{j}\) of \(X\) affects \(Y\).

Is there an ideal \(f(X)\)? In particular, what is a good value for \(f(X)\) at any selectd value of \(X\), say \(X\) = 4? There can be many \(Y\) values at \(X\) = 4. A good value is:
\[ f(4) = E(Y\mid X = 4) \]
\(E(Y\mid X = 4)\) means expected value (average) of \(Y\) given \(X\) =4.
This ideal \(f(x) = E(Y\mid X = x)\) is called the regression function.

2.1.3 The Regression function \(f(x)\)

  • Is also defined for vector \(X\); e.g.
    \[ f(x) = f(x_{1},x_{2},x_{3}) = E(Y\mid X_{1}=x_{1},X_{2}=x_{2},X_{3}=x_{3}) \]
  • Is the ideal or optimal predictor of \(Y\) with regard to mean-squared rediction error: \(f(x) = E(Y\mid X = x)\) is the function that minimizes \(E[(Y - g(X))^2 \mid X=x]\) over all functions \(g\) at all points \(X=x\).
  • \(\epsilon = Y - f(x)\) is the irreducible error -- i.e. even if we knew \(f(X)\), we would still make errors in prediction, since at each \(X=x\) there is typically a distribution of possible \(Y\) values.
  • For any estimate \(\hat f(x)\) of \(f(x)\), we have:
    \[ E[(Y - \hat f(X))^2 \mid X =x ] = \underbrace{[f(x) - \hat f(x)]^2}_{Reducible} + \underbrace{Var(\epsilon)}_{Irreducible} \]
    > hats over functions tend to represent estimators.

2.1.4 How to estimate \(f\)

  • Typically we have few if any data points with \(X=4\) exactly.
  • So we cannot compute \(E(Y\mid X = x)!\)
  • Relax the definition and let:
    \[ \hat f(x) = Ave(Y\mid X \in N(x)) \]
    where \(N(x)\) is some neighborhood of \(x\).

This is called nearest neighbors or local averaging, relaxing the definition to allow for averaging of nearby points. This compensates for the fact the there may not be y point for any particular x point.

2.2 Dimensionality and Structured Models


  • Nearest neighbor averaging can be pretty good for small \(p\) -- i.e. $ p 4$ and large-ish \(N\).
  • We will discuss smoother versions, such as kernel and spline smoothing later in the course.
  • Nearest neighbor methods can be lousy when \(p\) is large. Reason: the curse of dimensionality. Nearest neighbors tend to be far away in high dimensions.
    • We need to get reasonable fraction of the \(N\) values of \(y_{i}\) to average to bring the variance down, --e.g. 10%.
    • A 10% neighborhood in high dimensions need no longer be local, so we lose the spirit of estimating \(E(Y\mid X=x)\) by local averageing.

2.2.1 Parametric and structured models

The linear model is an important example of a parametric model:
\[ F_{L}(X) = \beta_{0} + \beta_{1}X_{1} + \beta_{2}X_{2} + ...\beta_{p}X_{p} \]
* A linear model is specified in terms of \(p + 1\) parameters \(\beta_{0},\beta_{1},...,\beta_{p}\).
* We estimate the parameters by fitting the model to training data. * Although it is almost never correct, a linear model often serves as a good and interpretable approximation to the unknown true function \(f(X)\).

2.2.2 Some trade-offs

  • Prediction accuracy versus interprability.
    • Linear models are easy to interpret; thin-plate splines are not.
  • Good fit versus over-fit or under-fit.
    • How do we know when the fit is right?
  • Parsimony versus black-blox.
    • We often prefer a simpler model involving fewer variable over a black-box predictor involving them all.

2.3 Model Selection and Bias-Variance Tradeoff


2.3.1 Assessing Model Accuracy

Suppose we fit a model \(\hat f(x)\) to some training data \(Tr={x_{i},y_{i}}_{1}^N\), and we wish to see how well it performs.
* We could compute the average squared prediction error over Tr:
\[ MSE_{Tr} = Ave_{i \in Tr}[y_{i} - \hat f(x_{i})]^2 \]
This may be biased toward more overfit models.
* Instead we should, if possible, compute it using fresh data \(Te = {x_{i},y_{i}}_{1}^M\):
\[ MSE_{Te} = AVE+{i \in Te}[y_{i} - \hat f(x_{i})]^2 \]

2.3.2 Bias-Variance Trade-off

Suppose we haec fit a model \(\hat f(x)\) to some trianing data Tr, and let \((x_{0},y_{0})\) be a test observation drawn from the population. If the true model is \(Y = f(X) + \epsilon\) ( with \(f(x) = E(Y\mid X = x)\)), then
\[ E(y_{0} - \hat f(x_{0}))^2 = Var(\hat f(x_{0})) + [Bias(\hat fx_{0}))]^2 + Var(\epsilon) \]

The the expectation averages over the variability of \(y_{0}\), as well as the variability in Tr. Note that the \(Bias(\hat f(x_{0})) = E[\hat f(x_{0})] - f(x_{0})\).

Typically as the flexibility of \(\hat f\) increases, its variance increases and its bias decreases. So choosing the flexibility based on average test error emounts to a bias-variance trade-off.

2.4 Classification


2.4.1 Classification Problems

Here the response variable \(Y\) is qualitative - e.g. email is one of \(C = (spam,ham)\) (ham = good email), digit class is one of \(C = {0,1,...,9}\). Our gols are to:
* Build a classifier \(C(X)\) that assigns a class label from \(C\) to a future unlabeled observation \(X\).
* Assess the uncertainty in each classification
* Understand the roles of the different predictors among \(X = (X_{1},X_{2},...,X_{p})\).

2.4.2 Classification: Some details

  • Typically we measure the performance of \(\hat C(x)\) using the misclassification error rate:
    \[ Err_{Te} = Ave_{i \in Te}I[y_{i} \neq \hat c(x_{i})] \]
  • The Bayes classifier ( using the true \(p_{k}(x)\) ) has smallest error ( in the population)
  • Support-vector machines build structured models for \(C(x)\).
  • We will also build structured models for representing the \(p_{k}(x)\). e.g. Logistic regression, generalized additive models.

2.4.3 Introduction to R


## [1] 2 7 5
## [1]  4  7 10
## [1]  6 14 15
## [1] 0.5 1.0 0.5
## [1]      16  823543 9765625
## [1] 7
## [1] 7 5
## [1] 2 5
## [1] 5
##      [,1] [,2] [,3]
## [1,]    1    5    9
## [2,]    2    6   10
## [3,]    3    7   11
## [4,]    4    8   12
##      [,1] [,2]
## [1,]    7   11
## [2,]    8   12
##      [,1] [,2]
## [1,]    5    9
## [2,]    6   10
## [3,]    7   11
## [4,]    8   12
## [1] 1 2 3 4
##      [,1]
## [1,]    1
## [2,]    2
## [3,]    3
## [4,]    4
## [1] 4 3
##  [1] "alpha"              "alpha.fn"           "Auto"              
##  [4] "best.fit"           "boot.out"           "Carseats"          
##  [7] "coefi"              "cv.error"           "cv.error10"        
## [10] "cv.errors"          "cv.lasso"           "cv.ridge"          
## [13] "d"                  "dat"                "degree"            
## [16] "diamond_samp"       "diamonds"           "diamondsbig"       
## [19] "diamondsBigGia"     "diamondsurl"        "Direction.2005"    
## [22] "fit"                "fit.lasso"          "fit.ridge"         
## [25] "fit1"               "fit2"               "fit3"              
## [28] "fit4"               "fit5"               "fit6"              
## [31] "fit7"               "folds"              "galton"            
## [34] "glm.fit"            "glm.pred"           "glm.probs"         
## [37] "hisDiamond"         "Hitters"            "i"                 
## [40] "k"                  "knn.pred"           "lam.best"          
## [43] "lasso.tr"           "lda.fit"            "lda.pred"          
## [46] "lm.fit"             "lm.fit1"            "lm.fit2"           
## [49] "lm.fit5"            "LoadLibraries"      "loocv"             
## [52] "m1.1"               "m2.1"               "m3.1"              
## [55] "m4.1"               "m5.1"               "modelEstimate"     
## [58] "new.Xy"             "pred"               "predict.regsubsets"
## [61] "reg.summary"        "regfit.full"        "regfit.fwd"        
## [64] "regplot"            "rmse"               "rmse.cv"           
## [67] "Smarket.2005"       "thisDiamond"        "train"             
## [70] "val.errors"         "x"                  "x.test"            
## [73] "XLag"               "Xy"                 "y"                 
## [76] "z"
##  [1] "alpha"              "alpha.fn"           "Auto"              
##  [4] "best.fit"           "boot.out"           "Carseats"          
##  [7] "coefi"              "cv.error"           "cv.error10"        
## [10] "cv.errors"          "cv.lasso"           "cv.ridge"          
## [13] "d"                  "dat"                "degree"            
## [16] "diamond_samp"       "diamonds"           "diamondsbig"       
## [19] "diamondsBigGia"     "diamondsurl"        "Direction.2005"    
## [22] "fit"                "fit.lasso"          "fit.ridge"         
## [25] "fit1"               "fit2"               "fit3"              
## [28] "fit4"               "fit5"               "fit6"              
## [31] "fit7"               "folds"              "galton"            
## [34] "glm.fit"            "glm.pred"           "glm.probs"         
## [37] "hisDiamond"         "Hitters"            "i"                 
## [40] "k"                  "knn.pred"           "lam.best"          
## [43] "lasso.tr"           "lda.fit"            "lda.pred"          
## [46] "lm.fit"             "lm.fit1"            "lm.fit2"           
## [49] "lm.fit5"            "LoadLibraries"      "loocv"             
## [52] "m1.1"               "m2.1"               "m3.1"              
## [55] "m4.1"               "m5.1"               "modelEstimate"     
## [58] "new.Xy"             "pred"               "predict.regsubsets"
## [61] "reg.summary"        "regfit.full"        "regfit.fwd"        
## [64] "regplot"            "rmse"               "rmse.cv"           
## [67] "Smarket.2005"       "thisDiamond"        "train"             
## [70] "val.errors"         "x"                  "x.test"            
## [73] "XLag"               "Xy"                 "z"

##  [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear"
## [11] "carb"
## [1] 32 11
## [1] "data.frame"
##       mpg             cyl             disp             hp       
##  Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
##  1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
##  Median :19.20   Median :6.000   Median :196.3   Median :123.0  
##  Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
##  3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
##  Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
##       drat             wt             qsec             vs        
##  Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000  
##  1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000  
##  Median :3.695   Median :3.325   Median :17.71   Median :0.0000  
##  Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375  
##  3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000  
##  Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000  
##        am              gear            carb      
##  Min.   :0.0000   Min.   :3.000   Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000  
##  Median :0.0000   Median :4.000   Median :2.000  
##  Mean   :0.4062   Mean   :3.688   Mean   :2.812  
##  3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :1.0000   Max.   :5.000   Max.   :8.000

##  [1] ".GlobalEnv"        "Auto"              "package:stats"    
##  [4] "package:graphics"  "package:grDevices" "package:utils"    
##  [7] "package:datasets"  "package:methods"   "Autoloads"        
## [10] "package:base"

3 Chapter 3: Linear Regression


3.1 Simple Linear Regression


3.1.1 Linear Regression

  • Linear regression is a simple approach to supervised learning. It assumes that the dependence of \(Y\) on \(X_{1},X_{2},...,X_{p}\) is linear.
  • True regression functions are never linear!
  • Although it may seem overly simplistic, linear regression is extremely useful both conceptually and practically.

3.1.2 Linear regression for the advertising data

Consider the advertising data shown on the next slide.

Questions we might ask:
* Is there a relationship between advertising budget and sales?
* How strong is the relationship betweeen advertising budget and sales?
* Which media contribute to sales?
* How accurately can we predict future sales?
* Is the relationship linear?
* Is there synergy among the advertising media?

3.1.3 Simple linear regression using a single predictor \(X\).

  • We assume a model
    \[ Y = \beta_{0} + \beta_{x}X + \epsilon \],
    Where \(\beta_{0}\) and \(\beta_{1}\) are two unknown constants that represent the intercept and slope, also known as coefficients or parameters, and \(\epsilon\) is the error term.
  • Given some estimates \(\hat\beta_{0}\) and \(\hat\beta_{1}\) for the model coefficients, we predict the future sales using
    \[ \hat y = \hat\beta_{0} + \hat\beta_{1}x \],
    Where \(\hat y\) indicates a prediction of \(Y\) on the basis of \(X = x\). The hat symbol denotes an estimated value.

3.1.4 Estimation of the parameters by least squares

  • let \(\hat y_{i} = \hat\beta_{0} + \hat\beta_{1}x_{i}\) be the prediction for \(Y\) based on the ith value of \(X\). Then \(e_{i} = y_{i} - \hat y_{i}\) represents the ith residual.
  • We define the residual sum of squares (RSS) as
    \[ RSS = e_{1}^2 + e_{2}^2 +...+e_{n}^2 \],
    or equivalently as
    \[ RSS = (y_{1} - \hat\beta_{0} - \hat\beta_{1}x_{1})^2 + (y_{2} - \hat\beta_{0} - \hat\beta_{1}x_{2})^2 +...+(y_{n} - \hat\beta_{0} - \hat\beta_{1}x_{n})^2 \]
  • The least squares approach chooses \(\hat\beta_{0}\) and \(\hat\beta_{1}\) to minimize the RSS. The minimizing values can be shown to be
    \[ \hat\beta_{1} = \frac{\sum_{i=1}^n(x_{i}-\bar x)(y_{i}-\bar y)}{\sum_{i =1}^n(x_{i}-\bar x)^2} \],
    \[ \hat\beta_{0}=\bar y - \hat\beta_{1}\bar x \],
    Where \(\bar y \equiv \frac{1}{n}\sum_{i =1}^n x_{i}\) are the sample means.

3.1.5 Assesing the Accuracy of the Coefficient Estimates

  • The standard error of an estimator reflects how it varies under repeated sampling. We have
    \[ SE(\beta_{1})^2 = \frac{\sigma^2}{\sum_{i =1}^n (x_{i}-\bar x)^2}\],
    \[ SE(\hat\beta_{0})^2 = \sigma^2[\frac{1}{n} + \frac{\bar x^2}{\sum_{i =1}^n (x_{i} - \bar x)^2}] \],
    where \(\sigma^2 = Var(\epsilon)\)
  • These Standard errors can be used to compute confidence intervals. A 95% confidence interval is defined as a range of values such that with 95% probability, the range will contain the true unknown value of the parameter. It has the form
    \[ \hat\beta_{1} \pm 2 \cdot SE(\hat\beta_{1}) \]

That is, there is approximately a 95% chance that the interval
\[ [ \hat\beta_{1} -2 \cdot SE(\hat\beta_{1}), \hat\beta_{1} + 2 \cdot SE(\hat\beta_{1})] \]
will contain the true value of \(\beta_{1}\) (under a scenario where we got repeated samples like the present sample)

For the advertising data, the 95% confindence interval for \(\beta_{1}\) is [0.042,0.05]

3.2 Hypothesis Testing and Confidence Intervals


Hypothesis Testing
* Standard errors can also be used to perform hypthesis test on the coefficients. The most common hypothesis test involves testing the null hypothesis of
* \(H_{0}:\) There is no relationship between \(X\) and \(Y\) versus the alternative hypothesis
* \(H_{A}:\) There is some relationship between \(X\) and \(Y\)

  • Mathematically, this corresponds to testing
    • \(H_{0} : \beta_{1} = 0\)
      versus
    • \(H_{A} : \beta_{1} \neq 0\),
      since if \(\beta_{1} = 0\) then the model reduces to \(Y = \beta_{0} + \epsilon\), and \(X\) is not associated with \(Y\).
  • To test the null hypothesis, we compute a t-statistic, given by
    \[ t = \frac{\hat\beta_{1} - 0}{SE(\hat\beta_{1})} \]
  • This will have a t-distribution with n - 2 degrees of freedom, assuming \(\beta_{0} = 0\).
  • Using statistical software, it is easy to compute the probability of observing any value equal to \(|t|\) or larger. We call this probability the p-value.

3.2.1 Assessing the Overall Accuracy of the Model

  • We compute the Residual Standard Error
    \[ RSE = \sqrt{\frac{1}{n -2}RSS} = \sqrt{\frac{1}{n -2}\sum_{i=1}^n (y_{i} - \hat y_{i})^2} \], where the residual sum-of-squares is \(RSS = \sum_{i=1}^n (y_{i}-\hat y_{i})^2\).

  • R-squared or fraction of variance explained is
    \[ R^2 = \frac{TSS - RSS}{TSS} = 1 - \frac{RSS}{TSS} \]
    where \(TSS = \sum_{i=1}^n (y_{i} - \bar y)^2\) is the total sum of squares.
  • It can be shown that in this simple linear regression setting that \(R^2 = r^2\), where \(r\) is the correlation between \(X\) and \(Y\):
    \[ r = \frac{\sum_{i=1}^n (x_{i} - \bar x)(y_{i} - \bar y)}{\sqrt{\sum_{i=1}^n (x_{i} - \bar x)^2 \sum_{i=1}^n (y_{i} - \bar y)^2}} \]

3.3 Multiple Linear Regression


  • Here our model is
    \[ Y = \beta_{0} + \beta_{1}X_{1} + \beta_{2}X_{2} +...+\beta_{p}X_{p} + \epsilon \]
  • We interpret \(\beta_{j}\) as the average effect on \(Y\) of a one unit increase in \(X_{j}\), holding all other predictors fixed. In the advertising example, the model becomes

\[ sales = \beta_{0} + \beta{1} \times TV + \beta_{2} \times radio + \beta_{3} \times newspaper + \epsilon \]

3.3.1 Interpreting the regression coefficients

  • The ideal scenario is when the predictors are uncorrelated
    • A balanced design:
    • Each coefficient can be estimated and tested seperately.
    • Interpretations such as "a unit change in \(X_{j}\) is associated with a \(B_{j}\) change in \(Y\), while all the other variables stay fixed", are possible.
  • Correlations amongst predictors cause problems:
    • The variance of all coefficients tends to increase, sometimes dramatically
    • Interpretations become hazardous - When \(X_{j}\) changes, everything else changes.
  • Claims of causality should be avoided for observational data.

3.3.2 The woes of (interpreting) regression coefficients

"Data Analysis and Regression" Mosteller and Tukey 1977
* a regression coefficient \(B_{J}\) estimates the expected change in \(Y\) per unit change in \(X_{j}\), with all other predictors held fixed. But predictors usually change together!
* Example: \(Y\) total amount of change in your pocket: \(X_{1}\) = # of coins; \(X_{2}\) = # of pennies, nickels and dimes. By itself, regression coefficinet of \(Y\) on \(X_{2}\) will be > 0. But how about with \(X_{1}\) in model?
* Example 2: \(Y\) = number of tackles by a football player in a season; \(W\) and \(H\) are his weight and height. Fitted regression model is \(\hat Y = \beta_{0} + .50W - .10H\). How do we interpret \(\hat\beta_{2} < 0\)?

3.3.3 Two quotes by famous Statisticians

"Essentially, all models are wrong, but some are useful." - George Box
"The onle way to find out what will happen when a complex system is disturbed is to disturb the system, not merely to observe it passively"
-Fred Mostellyer and John Tukey

3.3.4 Estimation and Prediction for Multiple Regression

  • Given estimates \(\hat\beta_{0},\hat\beta_{1},...\hat\beta_{p}\), we can make predictions using the formula
    \[ \hat y = \hat\beta_{0} + \hat\beta_{1}x_{1} + \hat\beta_{2}x_{2}+...+\hat\beta_{p}x_{p} \].
  • We estimate \(\beta_{0},\beta_{1},...,\beta_{p}\) as the values that minimize the sum of squared residuals
    \[ RSS = \sum_{i=1}^n (y_{i} - \hat y_{i})^2 \]
    \[ = \sum_{i=1}^n (y_{i} - \hat\beta_{0} - \hat\beta_{1}x_{i1} - \hat\beta_{2}x_{i2} - ... - \hat\beta_{p}x_{ip})^2 \]
    This is done using standard statistical software. The values \(\hat\beta_{0},\hat\beta_{1},...,\hat\beta_{p}\) that minimizes RSS are the multiple least squares regression coefficient estimates.

3.4 Some important Questions


  1. Is at least on of the predictors \(X_{1},X_{2},...X_{p}\) usefule in predicting the response?
  2. Do all the predictors help to explain \(Y\), or is only a subset of the predictors useful?
  3. How well does the model fit the data?
  4. Given a set of predictor values, what response value should we predict, and how accurate is our prediction?

3.4.1 Is at least one predictor useful?

For the first question, we can use the F statistic
\[ F = \frac{(TSS - RSS)/p}{RSS/(n - p - 1)} ~ F_{p,n-p-1} \]
example:
Quantity | Value ---------|---------- Residual Standard Error | 1.69
\(R^2\) | 0.897 F-Statistic | 570

3.4.2 Deciding on the important variables

  • The most direct approach is called all subsets or best subsets regression: we compute the least squares fit for all possible subsets and then choose between them based on some criterion that balances training error with model size.
  • However we often can't examine all possible models, since they are \(2^p\) of them; for example when \(p = 40\) there are over a billion models!!!!
  • Instead we need an automated approach that searches through a subset of them. We discuss two commonly used approaches next.

3.4.3 Forward selection

  • Begin with the null model - a model that contains an intercept but no predictors.
  • Fit \(p\) simple linear regressions and add to the null model the variable that results in the lowest RSS.
  • Add to that model the variable that results in the lowest RSS amongst all two-variable models.
  • Continue until some stopping rule is satisfied, for example when all remaining variables have a p-value above some threshold.

3.4.4 Backward selection

  • Start with all variables in the model.
  • Remove the variable with the largest p-value -- that is, the variable that is the least statistically significant.
  • The new \((p -1)\)-variable model is fit, and the variable with the largest p-value is removed.
  • Continue until a stopping rule is reached. For instance, we may stop when all remaining variable have a significant p-value defined by some significance threshold.

3.4.5 Model selection - continued

  • Later we discuss more systematic criteria for choosing an "optimal" member in the path of models produced by forward or backward stepwise selection.
  • These include Mallow's \(C_{p}\), Akaike information criterion (AIC), Bayesian information criterion (BIC), adjusted \(R^2\) and Cross-validation (CV)

3.4.6 Other Considerations in the Regression Model

Qualitative Predictors
* Some predictors are not quantitative but are qualitative, taking a discrete set of values. * These are also called categorical predictors or factor variables.

3.4.7 Qualitative Predictors - contiued

Example: investigate differences in credit card balances between males and females, ignoring the other variables. We create a new variable
\[ x_{i} = \begin{cases} \beta_{0} + \epsilon_{i}, &\text{if i-th person is female} \\ \beta_{0} + \beta_{1} + \epsilon_{i}, &\text{if i-th person is male} \end{cases} \]
Resulting model:
\[ y_{i}=\beta_{0} + \beta_{1}x_{i} + \epsilon = \begin{cases} \beta_{0} + \epsilon_{i}, &\text{if i-th person is female} \\ \beta_{0} + \beta_{1} + \epsilon_{i}, &\text{if i-th person is male} \end{cases} \]
Intrepretation?

3.4.8 Qualitative predictors with more than two levels

  • With more than two levels, we create additional dummy variables. For example, for the ethnicity variable we create two dummy variables.
  • Then both of these variables can be used in the regression equation, in order to obtain the model \[ y_{i} = \beta_{0} + \beta_{1}x_{i1}+\beta_{2}x_{i2} + \epsilon = \begin{cases} \beta_{0} + \beta_{1} + \epsilon_{i}, &\text{if i-th person is Asiab} \\ \beta_{0} + \beta_{2} + \epsilon_{i}, &\text{if i-th person is Caucasian} \\ \beta_{0} + \epsilon_{i}, &\text{if i-th person is AA.} \end{cases} \]
  • There will always be one fewer dummy variables than the number of levels. the level with no dummy variable - African American in this example - is known as the baseline.

3.5 Extensions of the linear model


Removing the additive assumption: interactions and nonlinearity
Interactions:
* In our previous analysis of the Advertising data, we assumed that the effect on sales of increasing one advertising medium is independent of the amount spent on the other media. * For example, the linear model \[ \hat\text{sales} = \beta_{0} + \beta_{1} \times TV + \beta_{2} \times radio + \beta_{3} \times newspaper \]
states that the everage effect on sales of a one-unit increase in TV is always _{1}, regardless of the amount spent on radio.

3.5.1 Interactions - continued

  • But suppose that spending money on radio adverstising actually increases the effectiveness of TV advertising, so that the slope term for TV should increase as radio increases.
  • In this situation, given a fixed budget of $100,000, spending half on radio and half on TV may increase sales more than allocating the entire amount to either TV or to radio.
  • In marketing, this is known as a synergy effect, and in statistics it is referred to as an interaction effect.

3.5.2 Interpretation

  • The results in this table suggest that interactions are important.
  • The p-value for the interactio term \(TV \times radio\) is extremely low, indicating that there is strong evidence for \(H_{A} : \beta_{3} \neq 0\).
  • \(R^2\) for the interaction model is 96.8%, compared to only 89.7% for the model that predicts sales using TV and radio without an interaction term.

3.5.3 Interpretation - continued

  • This means that (96.8 - 89.7) / (100 - 89.7) = 69% of the variability in sales that remains after fitting the additivie model has been explained by the interaction term.
  • The coefficient estimates in the table suggest that an increase in TV advertising of $1,000 is associated with increaed sales of
    \((\hat\beta_{1} + \hat\beta_{3} \times \text{radio}) \times 1000 = 19 + 1.1 \times \text{radio}\) units.
  • An increase in radio advertising of $1,000 will be associated with an increase in sales of \((\hat\beta_{2} + \hat\beta_{3}) \times 1000 = 29 + 1.1 \times TV\) units.

3.5.4 Hierarchy

  • Somtimes it is the case that an interaction term has a very small p-value, but the associated main effects (in this case, TVT and radio) so not.
  • The hierarchy principle:
    • If we include an interaction in a model, we should also include teh main effects, even if the p-values associated with their coefficients are not significant.

3.5.5 Hierarchy - continued

  • The rationale for this principle es that interactions are hard to interpret in a model without main effects - their meaning is changed.
  • Specifically, the interaction terms also contain main effects if the model has no main effect terms.

3.5.6 Interactions between qualitative and quantitative variables

Consider the Credit data set, and suppose that we wish to predict balance using income (quantitative) and Student (qualitative). \[ balance_{1} \approx \beta_{0} + \beta_{1} \times income_{i} + \begin{cases} \beta_{2}, &\text{if i-th person is a student} \\ 0, &\text{if i-th person is not a student} \end{cases} \]
\[ = \beta_{1} \times income_{i} + \begin{cases} \beta_{0} + \beta_{2}, &\text{if i-th person is a student} \\ \beta_{0}, &\text{if i-th person is not a student} \end{cases} \]

With interactions, it take the form:
\[ balance_{1} \approx \beta_{0} + \beta_{1} \times income_{i} + \begin{cases} \beta_{2} + \beta_{3} \times income, &\text{if student} \\ 0, &\text{if not a student} \end{cases} \]
\[ =\begin{cases} (\beta_{0} + \beta_{2}) + (\beta_{1} + \beta_{3}) \times income_{i}, &\text{if student} \\ \beta_{0} + \beta_{1} \times income_{i}, &\text{if i-th person is not a student} \end{cases} \]

3.5.7 Generalization of Linear Model

In much of the rest of this course, we discuss methods that expand the scope of linear models and how they are fit:
* Classification problems: logistic regression, suppot vector machines
* Non-linearity: kernel smoothing, splines and generalized additive models, nearest neighbor methods.
* Interactions: Tree-based methods, bagging, random forests and boosting ( these capture non-linearities)
* Regularized fitting: Ridge regression and lasso

3.6 Linear Regression with R

## Warning: package 'MASS' was built under R version 3.1.3
## 
## Attaching package: 'ISLR'
## 
## The following objects are masked _by_ '.GlobalEnv':
## 
##     Auto, Hitters
##  [1] "crim"    "zn"      "indus"   "chas"    "nox"     "rm"      "age"    
##  [8] "dis"     "rad"     "tax"     "ptratio" "black"   "lstat"   "medv"
## 
## Call:
## lm(formula = medv ~ lstat, data = Boston)
## 
## Coefficients:
## (Intercept)        lstat  
##       34.55        -0.95
## 
## Call:
## lm(formula = medv ~ lstat, data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.168  -3.990  -1.318   2.034  24.500 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 34.55384    0.56263   61.41   <2e-16 ***
## lstat       -0.95005    0.03873  -24.53   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.216 on 504 degrees of freedom
## Multiple R-squared:  0.5441, Adjusted R-squared:  0.5432 
## F-statistic: 601.6 on 1 and 504 DF,  p-value: < 2.2e-16

##  [1] "coefficients"  "residuals"     "effects"       "rank"         
##  [5] "fitted.values" "assign"        "qr"            "df.residual"  
##  [9] "xlevels"       "call"          "terms"         "model"
##                 2.5 %     97.5 %
## (Intercept) 33.448457 35.6592247
## lstat       -1.026148 -0.8739505
##        fit      lwr      upr
## 1 29.80359 29.00741 30.59978
## 2 25.05335 24.47413 25.63256
## 3 20.30310 19.73159 20.87461
## 
## Call:
## lm(formula = medv ~ lstat + age, data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.981  -3.978  -1.283   1.968  23.158 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 33.22276    0.73085  45.458  < 2e-16 ***
## lstat       -1.03207    0.04819 -21.416  < 2e-16 ***
## age          0.03454    0.01223   2.826  0.00491 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.173 on 503 degrees of freedom
## Multiple R-squared:  0.5513, Adjusted R-squared:  0.5495 
## F-statistic:   309 on 2 and 503 DF,  p-value: < 2.2e-16
## 
## Call:
## lm(formula = medv ~ ., data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.595  -2.730  -0.518   1.777  26.199 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.646e+01  5.103e+00   7.144 3.28e-12 ***
## crim        -1.080e-01  3.286e-02  -3.287 0.001087 ** 
## zn           4.642e-02  1.373e-02   3.382 0.000778 ***
## indus        2.056e-02  6.150e-02   0.334 0.738288    
## chas         2.687e+00  8.616e-01   3.118 0.001925 ** 
## nox         -1.777e+01  3.820e+00  -4.651 4.25e-06 ***
## rm           3.810e+00  4.179e-01   9.116  < 2e-16 ***
## age          6.922e-04  1.321e-02   0.052 0.958229    
## dis         -1.476e+00  1.995e-01  -7.398 6.01e-13 ***
## rad          3.060e-01  6.635e-02   4.613 5.07e-06 ***
## tax         -1.233e-02  3.760e-03  -3.280 0.001112 ** 
## ptratio     -9.527e-01  1.308e-01  -7.283 1.31e-12 ***
## black        9.312e-03  2.686e-03   3.467 0.000573 ***
## lstat       -5.248e-01  5.072e-02 -10.347  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.745 on 492 degrees of freedom
## Multiple R-squared:  0.7406, Adjusted R-squared:  0.7338 
## F-statistic: 108.1 on 13 and 492 DF,  p-value: < 2.2e-16

## 
## Call:
## lm(formula = medv ~ crim + zn + chas + nox + rm + dis + rad + 
##     tax + ptratio + black + lstat, data = Boston)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.5984  -2.7386  -0.5046   1.7273  26.2373 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  36.341145   5.067492   7.171 2.73e-12 ***
## crim         -0.108413   0.032779  -3.307 0.001010 ** 
## zn            0.045845   0.013523   3.390 0.000754 ***
## chas          2.718716   0.854240   3.183 0.001551 ** 
## nox         -17.376023   3.535243  -4.915 1.21e-06 ***
## rm            3.801579   0.406316   9.356  < 2e-16 ***
## dis          -1.492711   0.185731  -8.037 6.84e-15 ***
## rad           0.299608   0.063402   4.726 3.00e-06 ***
## tax          -0.011778   0.003372  -3.493 0.000521 ***
## ptratio      -0.946525   0.129066  -7.334 9.24e-13 ***
## black         0.009291   0.002674   3.475 0.000557 ***
## lstat        -0.522553   0.047424 -11.019  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.736 on 494 degrees of freedom
## Multiple R-squared:  0.7406, Adjusted R-squared:  0.7348 
## F-statistic: 128.2 on 11 and 494 DF,  p-value: < 2.2e-16
## 
## Call:
## lm(formula = medv ~ lstat * age, data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.806  -4.045  -1.333   2.085  27.552 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 36.0885359  1.4698355  24.553  < 2e-16 ***
## lstat       -1.3921168  0.1674555  -8.313 8.78e-16 ***
## age         -0.0007209  0.0198792  -0.036   0.9711    
## lstat:age    0.0041560  0.0018518   2.244   0.0252 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.149 on 502 degrees of freedom
## Multiple R-squared:  0.5557, Adjusted R-squared:  0.5531 
## F-statistic: 209.3 on 3 and 502 DF,  p-value: < 2.2e-16
## 
## Call:
## lm(formula = medv ~ lstat + I(lstat^2), data = Boston)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.2834  -3.8313  -0.5295   2.3095  25.4148 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 42.862007   0.872084   49.15   <2e-16 ***
## lstat       -2.332821   0.123803  -18.84   <2e-16 ***
## I(lstat^2)   0.043547   0.003745   11.63   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.524 on 503 degrees of freedom
## Multiple R-squared:  0.6407, Adjusted R-squared:  0.6393 
## F-statistic: 448.5 on 2 and 503 DF,  p-value: < 2.2e-16

##  [1] "Sales"       "CompPrice"   "Income"      "Advertising" "Population" 
##  [6] "Price"       "ShelveLoc"   "Age"         "Education"   "Urban"      
## [11] "US"
##      Sales          CompPrice       Income        Advertising    
##  Min.   : 0.000   Min.   : 77   Min.   : 21.00   Min.   : 0.000  
##  1st Qu.: 5.390   1st Qu.:115   1st Qu.: 42.75   1st Qu.: 0.000  
##  Median : 7.490   Median :125   Median : 69.00   Median : 5.000  
##  Mean   : 7.496   Mean   :125   Mean   : 68.66   Mean   : 6.635  
##  3rd Qu.: 9.320   3rd Qu.:135   3rd Qu.: 91.00   3rd Qu.:12.000  
##  Max.   :16.270   Max.   :175   Max.   :120.00   Max.   :29.000  
##    Population        Price        ShelveLoc        Age       
##  Min.   : 10.0   Min.   : 24.0   Bad   : 96   Min.   :25.00  
##  1st Qu.:139.0   1st Qu.:100.0   Good  : 85   1st Qu.:39.75  
##  Median :272.0   Median :117.0   Medium:219   Median :54.50  
##  Mean   :264.8   Mean   :115.8                Mean   :53.32  
##  3rd Qu.:398.5   3rd Qu.:131.0                3rd Qu.:66.00  
##  Max.   :509.0   Max.   :191.0                Max.   :80.00  
##    Education    Urban       US     
##  Min.   :10.0   No :118   No :142  
##  1st Qu.:12.0   Yes:282   Yes:258  
##  Median :14.0                      
##  Mean   :13.9                      
##  3rd Qu.:16.0                      
##  Max.   :18.0
## 
## Call:
## lm(formula = Sales ~ . + Income:Advertising + Age:Price, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.9208 -0.7503  0.0177  0.6754  3.3413 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         6.5755654  1.0087470   6.519 2.22e-10 ***
## CompPrice           0.0929371  0.0041183  22.567  < 2e-16 ***
## Income              0.0108940  0.0026044   4.183 3.57e-05 ***
## Advertising         0.0702462  0.0226091   3.107 0.002030 ** 
## Population          0.0001592  0.0003679   0.433 0.665330    
## Price              -0.1008064  0.0074399 -13.549  < 2e-16 ***
## ShelveLocGood       4.8486762  0.1528378  31.724  < 2e-16 ***
## ShelveLocMedium     1.9532620  0.1257682  15.531  < 2e-16 ***
## Age                -0.0579466  0.0159506  -3.633 0.000318 ***
## Education          -0.0208525  0.0196131  -1.063 0.288361    
## UrbanYes            0.1401597  0.1124019   1.247 0.213171    
## USYes              -0.1575571  0.1489234  -1.058 0.290729    
## Income:Advertising  0.0007510  0.0002784   2.698 0.007290 ** 
## Price:Age           0.0001068  0.0001333   0.801 0.423812    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.011 on 386 degrees of freedom
## Multiple R-squared:  0.8761, Adjusted R-squared:  0.8719 
## F-statistic:   210 on 13 and 386 DF,  p-value: < 2.2e-16
##        Good Medium
## Bad       0      0
## Good      1      0
## Medium    0      1

4 Chapter 4 - Classification


4.1 Introduction to Classification Problems


Classification
* Qualitative variables take values in an ordered set \(C\), such as: \[ \text{eye Color} \in {brown,blue,green} \] \[ \text{email } \in {spam, ham} \]
* Given a featrue vector \(X\) and a qualitative response \(Y\) taking values in the set \(C\), the classification task is to build a function \(C(X)\) that takes as input the feature vector \(X\) and predicts its value for \(Y\); i.e. \(C(X) \in C\).
* Often we are more interested in estimating the probabilities that \(X\) belongs to each category \(C\).
* For example, it's more valuable to have an estimate of the probability that an insurance claim is fraudulent, than a classification fraudulent or not.

4.1.1 Can we use Linear Regression?

Suppose for the Default classification task we code
\[ Y = \begin{cases} 0, &\text{if NO} \\ 1, &\text{if YES} \end{cases} \]
Can we simply perform a linear regressio of \(Y\) on \(X\) and classify as Yes if \(\hat Y > 0.5\)?

  • In this case of a binary outcome, linear regression does a good job as a classifier, and is equivalent to linear discriminant anaylysis which we discuss later.
  • Since in the population \(E(Y\mid X = x) = PR(Y = 1 \mid X = x)\), we might think that regression is perfect for this task.
  • However, linear regression might produce probabilities less than zero or bigger than one. Logistic regression is more appropriate.

4.1.2 Linear Regressiom continued

Now suppose we have a response variable with three possible values. A patient presents at the emergency room, and we must classify them according to their symptoms.
\[ Y = \begin{cases} 1, &\text{if stroke} \\ 2, &\text{if drug overdose} \\ 3, &\text{if epileptic seizure} \end{cases} \]
This coding suggest on ordering, and in fact implies that the difference between stroke and drug overdose is the same as between drug overdose and epileptic seizure.

4.2 Logistic Regression


Let's write \(p(X) = Pr(Y = 1 \mid X)\) for short and consider using balance to predict default. Logistic regression uses the form
\[ p(X) = \frac{e^{\beta_{0} + \beta_{1}X}}{1 + e^{\beta_{0} + \beta_{1}X}} \]
(\(e \approx 2.71828\) is a mathematical constant [Euler's number.]) It is easy to see that no matter what values \(\beta_{0},\beta_{1}\) or \(X\) take, \(p(X)\) will have values between 0 and 1.

A bit of rearrangement gives \[log(\frac{p(x)}{1 - p(x)}) = \beta_{0} + \beta_{1}X \]
This monotone transformation is called the log odds or logit transformation of \(p(x)\).

4.2.1 Maximum Likelihood

We use maximum likelihood to estimate the parameters.
\[ \mathbb{l}(\beta_{0},\beta) = \prod_{i:y_{i}=1}p(x_{i}) \prod_{i:y_{i}=0} (1 - p(x_{i})) \]
This likelihood gives the probability of the observed zeros and ones in the data. We pick \(\beta_{0}\) and \(\beta_{1}\) to maximize the likelihood of the observed data.

Most statistical packages can fit linear logistic regression models by maximum likelihood. In R we use the glm function.

4.2.2 Making Predictions

What is our estimated probability of default for someone with a balance of $1000?
\[ \hat p(X) = \frac{e^{\hat\beta_{0}+\hat\beta_{1}X}}{1 + e^{\hat\beta_{0}+\hat\beta_{1}X}} = \frac{e^{-10.6513+0.0055\times1000}}{1 + e^{-10.6513+0.0055\times1000}} = 0.006 \]
With a balance of $2000?
\[ \hat p(X) = \frac{e^{\hat\beta_{0}+\hat\beta_{1}X}}{1 + e^{\hat\beta_{0}+\hat\beta_{1}X}} = \frac{e^{-10.6513+0.0055\times2000}}{1 + e^{-10.6513+0.0055\times2000}} = 0.586 \]

Lets do it again, using Student as the predictor.
\[\widehat{PR}(default=Yes\mid student=Yes) = \frac{e^{-3.5041+0.4049\times1}}{1+e^{-3.5041+0.4049\times1}} = 0.0431\], \[\widehat{PR}(default=Yes\mid student=Nes) = \frac{e^{-3.5041+0.4049\times0}}{1+e^{-3.5041+0.4049\times0}} = 0.0431\]

4.3 Multivariate Logistic Regression


Logistic regression with several variables
\[ log(\frac{p(X)}{1-p(X)}) = \beta_{0} + \beta_{1}X_{1}+...+\beta_{p}X_{p} \]
\[ p(X) = \frac{e^{\beta_{0}+\beta_{1}X_{1}+...+\beta_{p}X_{p}}}{1 + e^{\beta_{0}+\beta_{1}X_{1}+...+\beta_{p}X_{p}}} \]

4.3.1 Confounding

  • Students tend to have higher balances than non-students, so their marginal default rate is higher than for non-students.
  • But for each level of balance, students default less than non-students.
  • Multiple logistic regression can tease this out.

4.4 Logistic Regression-Case-Control Sampling and Multiclass


  • In South African data, there are 160 cases, 302 controls \(\hat\pi = 3.5\) are cases. Yet the prevelance of MI(Myocardio Infarction) in this region is \(\pi = 0.05\).
  • With case-control samples, we can estimate the regression parameters \(\beta_{j}\) accurately ( if our model is correct); the constant term \(\beta_{0}\) is incorrect.
  • We can correct the estimated intercept by a simple transformation
    \[ \hat\beta_{0}^* = \hat\beta_{0} + log\frac{\pi}{1 - \pi} - log\frac{\hat\pi}{1 - \hat\pi} \]
  • Often cases are rare and we take them all; up to five times that number of controls is sufficient.

4.4.1 Diminishing returns in unbalanced binary data

  • Sampling more controls than cases reduces the variance of the parameter estimates. But after a ration of about 5 to 1 the variance reduction flattens out.

4.4.2 Logistic regression with more than two classes

So far we have discussed logistic regression with two classes. It is easily generalized to more than two classes. One version (used in the R package glmnet) has the symmetric form
\[ Pr(Y = k \mid X) = \frac{e^{\beta_{0k}+\beta_{1k}X_{1}+...+\beta_{pk}X_{p}}}{\sum_{l=1}^K e^{\beta_{0l}+\beta_{1l}X_{1l}+...+\beta_{pl}X_{p}} \]
Here there is a linear function for each class.

Multiclass logistic regression is also referred to as multinomial regression.

4.5 Discrimant Analysis


Here the approach is to model the distribution of \(X\) in each of the classes separately, and then use Bayes theorem to flip things around and obtain \(Pr(Y\mid X)\).
When we use normal (Gaussain) distributions for eachg class, this leads to linear or quadratic discriminant analysis.
However, this approach is quite general, and other distributions can be used as well. We will focus on normal distributions.

4.5.1 Bayes theorem for classification

Thomas Bayes was a famous mathematician whose name represents a big subfield of statistical and probablistic modeling. Here we focus on a simple result, known as Bayes theorem:
\[Pr(Y = k \mid X = x) = \frac{Pr(X = x \mid Y = k) \cdot Pr(Y = l)}{Pr(X = x)} \]
One writes this slightly differently for discriminant analysis:
\[ Pr(Y = k \mid X = x) = \frac{\pi_{k}f_{k}(x)}{\sum_{l=1}^K \pi_{l}f_{l}(x)} \], where * \(f_{k}(x) = Pr(X = x \mid Y = k)\) is the density for \(X\) in class \(k\). Here we will use normal densities for these, seperately in each class.
* \(\pi_{k} = Pr(Y = k)\) is the marginal or prior probability for class k.

4.5.2 Classify to the highest density

We classify a new point according to which density is highest.
When the priors are different, we take them into account as well, and compare \(\pi_{k}f_{k}(x)\). On the right, we favor the pink class - the decision boundary has shifted to the left.

4.5.3 Why discriminant analysis?

  • When the classes are well-seperated, the parameter estimates for the logistic regression model are surprisingly unstable. Linear discriminant analysis does not suffer from this problem.
  • If \(n\) is small and the distribution of the predictors \(X\) is approximately normal is each of the classes, the linear discriminant model is again more stable than the logistic regression model.
  • Linear disriminant analysis is popular when we have more than two response classes, because it also provides low-dimensional views of the data.

4.6 Gaussian Discriminant Anaylsis - One variable


4.6.1 linear Discriminant Analysis when p =1

The Gausssian density has the form
\[ f_{k}(x) = \frac{1}{\sqrt{2\pi\sigma_{k}}}e^{-\frac{1}{2}(\frac{x - \mu_{k}}{\sigma_{k}})^2} \]
Here \(\mu_{k}\) is the mean, and \(\sigma_{k}^2\) the variance (in class \(k\)). We will assume that all the \(\sigma_{k} = \sigma\) are the same.

Plugging this into Bayes formula, we get a rather complex expression for \(p_{k}(x) = Pr(Y = k \mid X = x)\):
\[ p_{k}(x)=\frac{\pi_{k}\frac{1}{\sqrt{2\pi\sigma}}e^{-\frac{1}{2}(\frac{x-\mu_{k}}{\sigma})^2}}{\sum_{l=1}^K \pi_{l}\frac{1}{\sqrt{2\pi\sigma}}e^{-\frac{1}{2}(\frac{x - \mu_{l}}{\sigma})^2}} \]
Happily, there are simplifications and cancellations.

4.6.2 Discriminant functions

To classify at the value of \(X = x\), we need to see which of the \(p_{x}(x)\) largest. Taking logs, and discarding terms that do not dpened on \(k\), we see that this is equivalent to assigning \(x\) to the class with the largest discriminant score:
\[ \delta_{k}(x) = x \cdot \frac{\mu_{k}}{\sigma^2} - \frac{\mu_{k}^2}{2\sigma^2} + log(\pi_{k}) \]
Note that \(\delta_{k}(x)\) is a linear function of x.
If there are \(K =2\) classes and \(\pi_{1} = \pi_{2} = 0.5\), then one can see that the decision boundary is at
\[ x = \frac{\mu_{1} + \mu_{2}}{2} \]

4.6.3 Estimating the parameters

\[ \hat\pi_{k} = \frac{n_{k}}{n} \]
\[ \hat\mu_{k} = \frac{1}{n_{k}}\sum_{i:y_{i}=k} x_{i} \]
\[ \hat\sigma^2 = \frac{1}{n - K}\sum_{k=1}^K\sum_{i:y_{i}=k}(x_{i}-\hat\mu_{k})^2 \]
\[ = \sum_{k=1}^K\frac{n_{k} -1}{n - K} \cdot \hat\sigma_{k}^2 \]
where \(\sigma_{k}^2 = \frac{1}{n_{k}-1}\sum_{i:y_{i}=k}(x_{i}-\hat\mu_{k})^2\) is the usual formula for the estimated variance in the kth class.

4.7 Gaussian Discriminant Analysis - Many Variables


4.7.1 Linear Discriminant Analysis when p > 1

Density: \(f(x) = \frac{1}{(2\pi)^{\frac{p}{2}|\Sigma|^{\frac{1}{2}}}e^{-\frac{1}{1}(x - \mu)^T\Sigma^-1(x - \mu)}\)
Discriminant function: \(\delta_{k}(x) = x^T\Sigma^{-1}\mu_{k}-\frac{1}{2}\mu_{k}^T\Sigma^{-1}\mu_{k}+log(\pi_{k})\)
It looks complex, but important thing to note is that it's, again, linear in x. Here's x alone multiplied by a coefficient vector(\(\Sigma^{-1}\mu_{k}\)), and these (\(-\frac{1}{2}\mu_{k}^T\Sigma^{-1}\mu_{k}+log(\pi_{k})\)) are all just constants. So this, again, is a linear function.

Despite its complex form,
\(\delta_{k}(x)= c_{k0}+c_{k1}x_{2}+...+c_{kp}x_{p}\) - a linear function.

4.7.2 Fisher's Discriminant Plot

When there are \(K\) classes, linear discriminant analysis can be viewed exactly in a \(K-1\) dimensional plot. Why? Because it essentially classifies to the closest centriod, and they span a \(K-1\) dimensional plane. Even when \(K > 3\), we can fine the 'best' 2-dimensional plane for visualizing the discriminant rule.

4.7.3 From \(\delta_{k}(x)\) to probabilities

Once we have estimates \(\hat\delta_{k}(x)\), we can turn these into estimates for class probabilities:
\[ \widehat{Pr}(Y = k\mid X = x) = \frac{e^{\hat\delta_{k}(x)}}{\sum_{l=1}^K e^{\hat\delta_{l}(x)}} \]
So classifying the largest \(\hat\delta_{k}(x)\) amounts to classifying to the class for which $ (Y = k X = x)$ is largest.
When \(K = 2\), we classify to class 2 if \(\widehat{Pr}(Y = 2 \mid X = x) \geq 0.5\) else to class 1.

4.7.4 LDA on Credit Data

  • | - | True | Default | Status --|---|------|---------|----------
  • | - | No | Yes | Total Predicted | No | 9644 | 252 | 9896 Default | Yes | 23 | 81 | 104 Status | Total | 9667 | 333 | 10000

(23 + 252)10000 errors -- a 2.75% misclassification rate!
Some Caveats:
* This is training error, and we may be overfitting. Not a big concern here since n = 10000, and p = 4!
* If we classsified to the prior -- always to class No in this case -- we would make 333/10000 errors, or only 3.33%. This called the null rate or the naive classifier. This is the rate of the prior our model has to at least outperform this rate or we are just wasting our time.
* Of teh true No's, we make 23/9667 = 0.2$ errors: of the true Yes's, we make 252/333 = 75.7% errors!

4.7.5 Types of errors

False positive rate: The fraction of negative examples that are classified as positive -- 0.2$ in example.
False negative rate: The fraction of positive examples that are classified as negative -- 75.7% in example.
We produced this table by classifying to class Yes if
\[ \widehat{Pr}(Default=Yes\mid Balance,Student) \geq 0.5\]
We can change the two error rates by changing the threshold from 0.5 to some other value in [0,1]:
\[ \widehat{Pr}(Default = Yes \mid Balance, Student) \geq threshold \]
and vary threshold.

4.7.6 ROC Curve

The ROC plot displays both simultaneously
Sometime we use the
AUC* or area under the curve to summarize the overall performance. Higher AUC* is good.

4.8 Quadratic Discriminant Analysis and Niave Bayes


\[ Pr(Y = k \mid X = x) = \frac{\pi_{k}f_{k}(x)}{\sum_{l=1}^K \pi_{l}f_{l}(x)} \]

When \(f_{k}(x)\) are Gaussian densities, with the same covariance matrix \(\Sigma\), this leads to linear discriminant analysis. By altering the forms for \(f_{k}(x)\), we get different classifiers..
* With Gaussians but different \(\delta_{k}\) in each class, we get quadratic discriminant analysis.
* With \(f_{k}(x) = \prod_{j=1}^p f_{jk}(x_{j})\) (conditional independence model) in each class we get naive Bayes. For Gaussian this means the \(\Sigma_{k}\) are diagonal.
* Many other forms, by proposing specific density models for \(f_{k}(x)\), including nonparametric approaches.

4.8.1 Quadratic Discriminant Analysis

\[ \delta_{k}(x) = -\frac{1}{2}(x - \mu_{k})^T \Sigma_{k}^{-1} (x - \mu_{k}) + log(\pi_{k}) - \frac{1}{2}log|\Sigma_{k}| \]
Because the \(\Sigma_{k}\) are different, the quadratic terms matter.

4.8.2 Naive Bayes

Assumes features are independent in each class. Useful when p is large, and so multivariate methods like QDA and even LDA break down.
* Gaussian naive Bayes assumes each \(\Sigma_{k}\) is diagonal:
\[ delta_{k}(x) \propto log\bigg[ \pi_{k}\prod_{j=1}^p f_{kj}(x_{j})\bigg] \] \[ = -\frac{1}{2}\sum_{j=1}^p\Big[\frac{(x_{j}-\mu_{kj})^2}{\sigma_{kj}^2} + \simga_{kj}^2\Big] + log\pi_{k} \]
* can use for mixed feature vectors (qualitative and quantitative). if \(X_{j}\) i qualitative, replace \(f_{kj}(x_{j})\) with the probability mass function ( histogram) over discrete categories.
Despite strong assumptions, naive Bayes ofter produces good classification results.

4.8.3 Logistic Regression versus LDA

For a two-class problem, one can show that for LDA
\[ log\big( \frac{p_{1}(x)}{1 - p_{1}(x)}\big) = log\big(\frac{p_{1}(x)}{p_{2}(x)}\big) = c_{0} + c_{1}x_{1} +...+ c_{p}x_{p} \]
So it has the same form as logistic regression.
The difference is in how the parameters are estimated.
* Logistic regression uses the conditional likelihood based on \(Pr(Y\mid X)\) (known as discriminant learning).
* LDA uses the full likelihood based on \(Pr(X,Y)\) (known as generative learning).
* Despite these differences, in practice the results are ofter very similar.
Footnote: logistic regression can also fit quadratic boundaries like QDA, by explicitly including quadratic terms in the model.

4.8.4 Summary

  • Logistic regression is very popular for classification, especially when K=2.
  • LDA is useful when \(n\) is small, or the classes are well seperated, and Gaussian assumptions are reasonable. Also when \(K > 2\).
  • Naive Bayes is useful when \(p\) is very large.

4.9 Classification in R


## [1] "Year"      "Lag1"      "Lag2"      "Lag3"      "Lag4"      "Lag5"     
## [7] "Volume"    "Today"     "Direction"
##       Year           Lag1                Lag2          
##  Min.   :2001   Min.   :-4.922000   Min.   :-4.922000  
##  1st Qu.:2002   1st Qu.:-0.639500   1st Qu.:-0.639500  
##  Median :2003   Median : 0.039000   Median : 0.039000  
##  Mean   :2003   Mean   : 0.003834   Mean   : 0.003919  
##  3rd Qu.:2004   3rd Qu.: 0.596750   3rd Qu.: 0.596750  
##  Max.   :2005   Max.   : 5.733000   Max.   : 5.733000  
##       Lag3                Lag4                Lag5         
##  Min.   :-4.922000   Min.   :-4.922000   Min.   :-4.92200  
##  1st Qu.:-0.640000   1st Qu.:-0.640000   1st Qu.:-0.64000  
##  Median : 0.038500   Median : 0.038500   Median : 0.03850  
##  Mean   : 0.001716   Mean   : 0.001636   Mean   : 0.00561  
##  3rd Qu.: 0.596750   3rd Qu.: 0.596750   3rd Qu.: 0.59700  
##  Max.   : 5.733000   Max.   : 5.733000   Max.   : 5.73300  
##      Volume           Today           Direction 
##  Min.   :0.3561   Min.   :-4.922000   Down:602  
##  1st Qu.:1.2574   1st Qu.:-0.639500   Up  :648  
##  Median :1.4229   Median : 0.038500             
##  Mean   :1.4783   Mean   : 0.003138             
##  3rd Qu.:1.6417   3rd Qu.: 0.596750             
##  Max.   :3.1525   Max.   : 5.733000

## 
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + 
##     Volume, family = binomial, data = Smarket)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.446  -1.203   1.065   1.145   1.326  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.126000   0.240736  -0.523    0.601
## Lag1        -0.073074   0.050167  -1.457    0.145
## Lag2        -0.042301   0.050086  -0.845    0.398
## Lag3         0.011085   0.049939   0.222    0.824
## Lag4         0.009359   0.049974   0.187    0.851
## Lag5         0.010313   0.049511   0.208    0.835
## Volume       0.135441   0.158360   0.855    0.392
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1731.2  on 1249  degrees of freedom
## Residual deviance: 1727.6  on 1243  degrees of freedom
## AIC: 1741.6
## 
## Number of Fisher Scoring iterations: 3
##         1         2         3         4         5 
## 0.5070841 0.4814679 0.4811388 0.5152224 0.5107812
##         Direction
## glm.pred Down  Up
##     Down  145 141
##     Up    457 507
## [1] 0.5216
##         Direction.2005
## glm.pred Down Up
##     Down   77 97
##     Up     34 44
## [1] 0.4801587
##         Direction.2005
## glm.pred Down  Up
##     Down   35  35
##     Up     76 106
## [1] 0.5595238
## Call:
## lda(Direction ~ Lag1 + Lag2, data = Smarket, subset = Year < 
##     2005)
## 
## Prior probabilities of groups:
##     Down       Up 
## 0.491984 0.508016 
## 
## Group means:
##             Lag1        Lag2
## Down  0.04279022  0.03389409
## Up   -0.03954635 -0.03132544
## 
## Coefficients of linear discriminants:
##             LD1
## Lag1 -0.6420190
## Lag2 -0.5135293

## [1] "list"
##      class posterior.Down posterior.Up         LD1
## 999     Up      0.4901792    0.5098208  0.08293096
## 1000    Up      0.4792185    0.5207815  0.59114102
## 1001    Up      0.4668185    0.5331815  1.16723063
## 1002    Up      0.4740011    0.5259989  0.83335022
## 1003    Up      0.4927877    0.5072123 -0.03792892
##       
##        Down  Up
##   Down   35  35
##   Up     76 106
## [1] 0.5595238
## Warning: package 'class' was built under R version 3.1.2
## The following objects are masked from Smarket (position 4):
## 
##     Direction, Lag1, Lag2, Lag3, Lag4, Lag5, Today, Volume, Year
##        Lag1   Lag2
## [1,]  0.381 -0.192
## [2,]  0.959  0.381
## [3,]  1.032  0.959
## [4,] -0.623  1.032
## [5,]  0.614 -0.623
##         
## knn.pred Down Up
##     Down   43 58
##     Up     68 83
## [1] 0.5
##         
## knn.pred Down Up
##     Down   48 56
##     Up     63 85
## [1] 0.5277778

5 Chapter 5 Resampling Methods


5.1 Cross-Validation


5.1.1 Cross-validation and the Bootstrap

  • In the section we discuss two resampling methods: cross-validation and the bootstrap.
  • These methods refit a model of interest to samples formed from the training set, in order to obtain additional information about the fitted model.
  • For example, they provide estimates of test-set prediction error, and the standard deviation and bias of our parameter estimates

5.1.2 Training Error versus Test error

  • Recall the distinnction between the test error and the training error:
  • The test error is the average error that results from using a statistical learning method to predict the response on a new observation, one that was not used in training the method.
  • In contrast, the training error can be easily calculated by applying the statistical learning method to the observations used in its training.
  • But the training error rate often is quite different from the test eoor rate, and in particular the former can dramatically underestimate the latter.

5.1.3 More on prediction-error estimates

  • Best solution: a large designated test set. Often not avialable.
  • Some methods make a mathematical adjustment to the training error rate in order to estimate and test error rate. These include the \(C_{p}\) statistic, AIC and BIC. They are discussed elsewhere is this course.
  • Here we instead consider a class of methods that estimate the test error by holding out a subset of the training observations from the fitting process, and teh applying the statistical learning method to those held out observations.

5.1.4 Validation-set approach

  • Here we randomly divide the avialable set of sample into two parts: a training set and a validation or hold-out set.
  • The model is fit on the training set, and the fitted model is used to predict the responses for the observations in the validation set.
  • The resulting validation-set error provides an estimate of the test error. This is typically assessed used MSE in the case of a quantitative response and misclassification rate in the case of a qualitative (discrete) response.

5.1.5 Drawbacks of validation set approach

  • The validation estimate of the test error can be highly variable, depending on precisely which observations are included in the training set and which observations are included in the validation set.
  • In the validation approach, only a subset of the observations -- those that are included in the training set rather than in the validation set -- are used to fit the model.
  • This suggests that the validation set error may tend to overestimate the test error for the model fit on the entire data set.

5.2 K-fold Cross-Validation


  • Widely used approach for estimating test error.
  • Estimates can be used to select the best model, and to give an idea of the test error of the fina chosen model.
  • Idea is to randomly divide the data into K equal-sized parts. We leave out part K, fit the model to the other K -1 parts (combined), and then obtain predictions for the left-out kth part.
  • This is done in turn for each part \(k=1,2,...K,\) and then the results are combined.

5.2.1 The details

  • Let the \(K\) parts be \(C_{1},C_{2},...C_{K}\) where \(C_{k}\) denotes the indices of the observations in part \(k\). Ther are \(n_{k}\) observations in part \(k\): if \(N\) is a multiple of \(K\), then \(n_{k}=n/K\).
  • Compute
    \[ CV_{(K)}=\sum_{k=1}^K \frac{n_{k}}{n}MSE_{k} \]
    where \(MSE_{k}=\sum_{i\in C_{k}}(y_{i}-\hat y_{i})^2 /n_{k}\), and \(\hat y_{i}\) is the fit for observation \(i\), obtained from the data with part \(k\) removed.
  • Setting \(K=n\) yields n-fold or leave-one out cross-validation (LOOCV).

5.2.2 A nice special case!

  • With least-squares linear or polynomial regression, an amazing shortcut makes the cost of LOOCV the same as that of a single model fit! The following formula holds:
    \[ CV_{(n)}=\frac{1}{n}\sum_{i=1}^n(\frac{y_{i}-\hat y_{i}}{1 - h_{i}})^2, \]
    where \(\hat y_{i}\) is the ith fitted value from the original least squares fit, and \(h_{i}\) i the leverage (diagonal of the "hat" matrix). This is like the ordinary MSE, except the ith residual is divided by \(1-h_{i}\).
  • LOOCV sometimes useful, but typically doesn't shake up the data enough. THe estimates from each fold are highly correlated and hence their average can have high variance.
  • A better choice is \(K = 5\) or 10.

5.2.3 Other issues with Cross-validation

  • Since each training set is only \((K-1)/K\) as big as the original training set, the estimates of prediction error will typically be biased upward.
  • This bias is minimized when \(K = n\) (LOOCV), but this estimate has high variance, as noted earlier.
  • \(K = 5\) or 10 provides a good compromise for this bias-variance tradeoff.

5.2.4 Cross-Validation for Classification Problems

  • We divide the data into K roughly equal-sized parts \(C_{1},C_{2},...C_{K}\). \(C_{k}\) denotes the indices of the observations in part \(k\). There are \(n_{k}\) observations in part \(k\): if \(n\) is a multiple of \(K\), then \(n_{k} = \frac{n}{K}\).
  • Compute
    \[ CV_{K} = \sum_{k=1}^K \frac{n_{k}}{n} Err_{k} \]
    where \(Err_{k} = \sum_{i\in C_{k}} I(y_{i} \neq \hat y_{i}) / n_{k}\).
  • The esitmated standard deviation of \(CV_{k}\) is
    \[ \widehat{SE}(CV_{K}) = \sqrt{\sum_{k=1}^K (Err - \bar Err_{k})^2 / (K -1)} \]
  • This is a useful estimate, but strictly speaking, not quite valid.

5.3 Cross-Validation: the wrong and right way


  • Consider a simple classifier applied to some two-class data:
    1. Starting with 5000 predictors and 50 samples, find the 100 predictors having the largest correlation with the class labels.
    2. We then apply a classifier such as logistic regression, using only these 100 predictors.
      How do we estimate the test set performance of this classifier?

Can we apply cross-validation in step 2, forgetting about step1?

NO!
* This would ignore the fact that in Step 1, the procedure has already seen the labels of the training data, and made use of them. This is a form of training and must be inclued in the validation process.
* It is easy to simulate realistic data with the class labels independent of the outcome, so that true test error = 50% but the CV error estimate that ignores Step 1 is zero!

5.3.1 The Wrong and Right Way

  • Wrong: Apply cross-validation is step 2.
  • Right: Apply cross-validation to steps 1 and 2.

5.4 The Bootstrap


  • The bootstrap is a flexible and powerful statistical tool that can be used to quantify the uncertainty associated with a given estimator or statistical learning method.
  • For example, it can provide an estimate of the standard error of a coefficient, or a confidence interval for that coefficient.

5.4.1 Where does the name come from?

  • The use of the term bootstrap derives from the phrase to pull oneself up by one's bootstraps, widely tought to be based on one of the eighteenth century "The Suprising Adventurs of Baron Munchausen" by Rudolph Erich Raspe: "The Baron had fallen to the bottom of a deep lake. Just when it looked like all was lost, he thought to pick himself up by is own bootstraps."
  • It is not the same as the term "bootstrap" used in computer science meaning to "boot" a computer from a set of core instructions, though the derivation is similar.

5.4.2 A simple example

  • Suppose that we wish to invest a fixed sum of money in two financial assests that yield returns of X and Y, respectively, where X and Y are random quantities.
  • We will invest a fraction \(\alpha\) of our money in X, and will invest the remaining \(1-\alpha\) in Y.
  • We wish to choose \(\alpha\) to minimize the total risk, or variance, of our investment. In other words, we want to minimize \(Var(\alpha X + (1 - \alpha) Y)\).
  • One can show that the value that minimizes the risk is given by:
    \[ \alpha = \frac{\sigma_{Y}^2 - \simga_{XY}}{\simga_{X}^2 + \simga_{Y}^2 - 2\sigma_{XY}} \],
    where \(\simga_{X}^2 = Var(X)\), \(\simga_{Y}^2 = Var(Y)\), and \(\simga_{XY} = Cov(X,Y)\).
  • But the values of \(\simga_{X}2,\simga_{Y}^2\), and \(\simga_{XY}\) are unknown.
  • We can compute estimates for these quantities,\(\simga_{X}2,\simga_{Y}^2\), and \(\simga_{XY}\), using a data set that contains measurements for X and Y.
  • We can then estimate the value of \(\alpha\) that minimizes the variace of our invest using:
    \[ \hat\alpha = \frac{\hat\sigma_{Y}^2 - \hat\simga_{XY}}{\hat\simga_{X}^2 + \hat\simga_{Y}^2 - 2\hat\sigma_{XY}} \]
  • To estimate the standard deviation of \(\hat\alpha\), we repeated the process of simulating 100 required paired observations of X and Y, and estimating \(\alpha\) 1,000 times.
  • We thereby obtain 1,000 estimates for \(\alpha\) which we can call \(\hat\alpha_{1},\hat\alpha_{2},...,\hat\alpha_{1000}\).
  • The left-hand panel of the Figure on slide 29 displays a histogram of the resulting estimates.
  • For these simulations the parameters were set to \(\sigma_{X}^2 = 1\), \(\simga_{Y}^2\), and \(\sigma_{XY} = 0.5\), and so we know that the true value of \(\alpha\) is 0.6 (indicated by the red line).
  • The mean over all 1,000 estimates for \(\alpha\) is
    \[ \bar\alpha=\frac{1}{1000}\sum_{r=1}^1000 \hat\alpha_{r} = 0.5996 \],
    very close to \(\alpha=0.6\), and the standard deviation of the estimate is
    \[ \sqrt{\frac{1}{1000-1}\sum_{r=1}^1000 (\hat\alpha_{r} - \bar\alpha)^2} =0.083 \]
  • This gives us a very good idea of the accuracy of \(\hat\alpha\): \(SE(\hat\alpha) \approx 0.083\).
  • So roughly speaking , for a random sample from the population, we would expect \(\hat\alpha\) to differ from \(\alpha\) by approximately 0.008, on average.

5.4.3 Now back to the real world

  • The procedure outline above cannot be applied, because for real data we cannot generate new samples from the original population.
  • However, the bootstrap approach allows us to use a computer to mimic the process of obtaining new data sets, so that we can estimate the variability of our estimate without generating additional samples.
  • Rather than repeatedly obtaining independent data sets from the population, we instead obtain distinct data sets by repeatedly sampling observations from the original data set with replacement.
  • Each of these "bootstrap data sets" is created by sampling with replacment, and is the same size as our original dataset. As a result some observations may appear more than once is a given bootstrap data set and some not at all.

5.5 More on the Bootstrap


5.5.1 The bootstrap in general

  • In more complex data situations, figuraing out the appropriate way to generate boostrap sampls can require some thought.
  • For example, if the data is a time series, we can't simply sample the observations with replacement (why not?).
    • Solution: the Block bootstrap, dived the data up into blocks and assumes that things are independent betweeen blocks.
  • We can instead create blocks of consecutive observations, and sample those with replacements. Then we paste together sample blocks to obtain a bootstrap dataset.

5.5.2 Other uses of the bootstrap

  • Primarily used to obtain standard errors of an estimate.
  • Also provides approximate confidence intervals for a population parameter. For example, looking at the histogram in the middle panel of the Figure on slide 29, the 5% and 95% quantiles of the 1000 values is (.43,.72).
  • This represents an approximate 90% confidence interval for the true \(\alpha\). * How do we interpret this confidence interval?*
  • The above interval is called a Boostrap Percentile confidence interval. It is the simplest method ( among many approaches) for obtaining a confidence interval from the bootstrap.

5.5.3 Can the bootstrap estimate prediction error?

  • In cross-validation, each of the K validation folds is distinct from the other K-1 folds used for training: There is no overlap. This is crucial for its success.
  • To estimate prediction error using the bootstrap, we could think about using each bootstrap dataset as our trainin sample, and the original sample as our validation sample.
  • But each bootstrap sample has significant overlap with the original data. About two-thirds of the original data points appear in each bootstrap sample.
  • This will cause the bootstrap to seriously underestimate the true prediction error.
  • The other way around -- with original sample - training sample, bootstrap dataset+ validation sample - is worse!

5.5.4 Removing the overlap

  • Can partly fix this problem by only using predictions for those observation that did not (by chance) occur in the current bootstrap sample.
  • But the method gets complicated, and in the end, cross-validation provides a simpler, more attractive approach for estimating prediction error.

5.6 Resampling in R


## Loading required package: boot
## Warning: package 'boot' was built under R version 3.1.3
##  [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear"
## [11] "carb"

## [1] 17.25330 17.19723
## [1] 17.2533
## Warning in cv.glm(Auto, glm.fit, K = 10): 'K' has been set to 11.000000
## Warning in cv.glm(Auto, glm.fit, K = 10): 'K' has been set to 11.000000
## Warning in cv.glm(Auto, glm.fit, K = 10): 'K' has been set to 11.000000
## Warning in cv.glm(Auto, glm.fit, K = 10): 'K' has been set to 11.000000
## Warning in cv.glm(Auto, glm.fit, K = 10): 'K' has been set to 11.000000

## [1] 0.5758321
## [1] 0.5758321
## [1] 0.5963833
## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = Portfolio, statistic = alpha.fn, R = 1000)
## 
## 
## Bootstrap Statistics :
##      original        bias    std. error
## t1* 0.5758321 -7.315422e-05  0.08861826

5.7 Chapter 5 Quiz


## [1] "X1" "X2" "y"
## 
## Call:
## lm(formula = y ~ X1 + X2, data = Xy)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.44171 -0.25468 -0.01736  0.33081  1.45860 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.26583    0.01988  13.372  < 2e-16 ***
## X1           0.14533    0.02593   5.604 2.71e-08 ***
## X2           0.31337    0.02923  10.722  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5451 on 997 degrees of freedom
## Multiple R-squared:  0.1171, Adjusted R-squared:  0.1154 
## F-statistic: 66.14 on 2 and 997 DF,  p-value: < 2.2e-16

## 
## Call:
## lm(formula = y ~ X1 + X2, data = Xy)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.44171 -0.25468 -0.01736  0.33081  1.45860 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.26583    0.01988  13.372  < 2e-16 ***
## X1           0.14533    0.02593   5.604 2.71e-08 ***
## X2           0.31337    0.02923  10.722  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5451 on 997 degrees of freedom
## Multiple R-squared:  0.1171, Adjusted R-squared:  0.1154 
## F-statistic: 66.14 on 2 and 997 DF,  p-value: < 2.2e-16
## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = Xy, statistic = alpha.fn, R = 2000)
## 
## 
## Bootstrap Statistics :
##      original        bias    std. error
## t1* 0.1453263 -0.0002776672  0.02895703
## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = new.Xy, statistic = alpha.fn, R = 2000)
## 
## 
## Bootstrap Statistics :
##      original       bias    std. error
## t1* 0.1453263 -0.008855075   0.1934213

6 Chapter 6 Linear Model Selection and Regularization


6.1 Introduction and Best-Subset Selection


6.1.1 Linear Model Selection and Regularization

  • Recall the linear model
    \[ Y = \beta_{0} + \beta_{1}X_{1}+...+\beta_{p}X_{p} + \epsilon \]
  • In the lectures that follow, we consider some approaches extending the linear model framework. In the lectures covering Chapter 7 of the text, we generalize the linear model in order to accomodate non-linear, but still additive, relationships.
  • In the lectures covering Chapter 8 we consider even more general non-linear models.

6.1.2 In praise of linear models!

  • Despite its simplicity, the linear model has distinct advantages in terms of its iterpretability and ofter shows good predictive performance.
  • Hence we discuss in this lecture some ways in which the simple linear model can be improved, by replacing ordinary least squares fitting with some alternative fitting procedures.

6.1.3 Why consider alternatives to least squares?

  • Prediction Accuracy: especially when \(p > n\), to control the variance.
  • Model Interpretability: By removing irrelevant features - that is, by setting the corresponding coefficient estimates to zero - we can obtain a model that is more easily interpreted. We will present some approaches for automatically performing feature selection.

6.1.4 Three classes of methods

  • Subset Selection: We identify a subset of the \(p\) predictors that we believe to be related to the response. We then fit a model using least squares on the reduced set of variables.
  • Shrinkage. We fit a model involving all \(p\) predictors, but the estimated coefficients are shrunken towards zero relative to the least squares estimates. This shrinkage (also known as regularization) has the effect of reducing variance and can alsos perfom variable selection.
  • Dimension Reduction. We project the \(p\) predictors into a M-dimensional subspace, where \(M < p\). This is achieved by computing M different linear combinations, or projections, of the variables. Then these M projections are used as predictors to fit a linear regression model by least squares.

6.1.5 Subset Selection

Best subset and stepwise model selection procedures

Best Subset Selection
1. Let \(M_{0}\) denote the null model, which contains no predictors. This model simply predicts the sample mean for each observation.
2. for \(k = 1,2,..p\):
a. Fit all $ p k$ models that contain exactly \(k\) predictors. b. Pick the best among these $ p k$ models, and call it \(M_{k}\). Here best is defined as having the smallest RSS, or equivalently largest \(R^2\).
3. Select a single best model among \(M_{0},...,M_{p}\) using cross-validated prediction error, \(C_{p}\) (AIC), BIC, or adjusted \(R^2\).

6.2 Stepwise Selection


6.2.1 Extensions to other models

Although we have presented best subset selection here for least squares regression, the same ideas apply to other types of models, such as logistic regression.
The devience-negative two times the maximized log-likelihood - plays the role of RSS for a broader class of models.

6.2.2 Stepwise Selection

  • For computational reasons, best subset selection cannot be applied with very large \(p\). Why not?
  • Best subset selection may also suffer from statistical problems when p is large: larger the search space, the higher the chance of finding models that look good on the training data, even though they might not have any predictive power on future data.
  • Thus an enormous searcy space can lead to overfitting and high variance of the coefficient estimates.
  • For both of these reasons, stepwise methods, which explore a far more restricted set of models, are attractive alternatives to best subst selection.

6.2.3 Forward Stepwise Selection

  • Forward stepwise selection begins with a model containing no predictors, and then adds predictors to the model, one-at-a-time, until all of the predictiors are in the model.
  • In particular, at each step the variable that gives the greatest additional improvement to the fit is added to the model.

6.2.4 In Detail

Forward Stepwise Selection
1. Let \(M_{0}\) denote the null model, which contains no predictors.
2. For \(k = 0,...,p-1\):
2.1 Consider all \(p-k\) models that augment the predictors in \(M_{k}\) with one additional predictor.
2.2 Choose the best among these \(p-k\) models, and call it \(M_{k+1}\). Here best is defined as having smallest RSS or highest \(R^2\).
3. Select a single best model from among \(M_{0},...,M_{p}\) using cross-validated prediction error, \(C_{P}\) (AIC), BIC, or adjusted \(R^2\).

6.2.5 More on Forward Stepwise Selection

  • Computational advantage over best subset selection is clear.
  • It is not guaranteed to find the best possible model out of all \(2^P\) models containing subsets of the \(p\) predictors. Why not? Give an example.

6.3 6.3 Backward Stepwise Selection


  • Like forward stepwise selection, backward stepwise selection provides an efficient alternative to best subset selection.
  • However, unlike forward stepwise selection, it begins with the full least squares model containing all \(p\) predictors, and then iteratively removes the least useful predictor, one-at-a-time.

6.3.1 Backward Stepwise Selection: details

Backward Stepwise Selection
1. Let \(M_{p}\) denote the full model, which contains all \(p\) predictors.
2. For \(k=p,p-1,...,1\):
2.1 Consider all \(k\) models that contain all but one of the predictors in \(M_{k}\), for a total of \(k-1\) predictors.
2.2 Choose the best among these k models, and call it \(M_{k-1}\). Here best is defined as having smallest RSS or highest \(R^2\).
3. Select a single best model from among \(M_{0},...M_{p}\) using cross-validated prediction error, \(C_{P}\) (AIC), BIC, or adjusted \(R^2\).

6.3.2 More on Backward Stepwise Selection

  • Like forward stepwise selection, the backward selection approach searches through only \(1 + p(p+1)/2\) models, and so can be applied in settings where \(p\) is too large to apply best subset selection
  • Like forward stepwise selection, backward stepwise selection is not guaranteed to yield the best model containing a subset of the \(p\) predictors.
  • Backward selection requires that the number of samples n is larger than the number of variables p (so that the full model can be fit). In contrast, forward stepwise can be used even when \(n>p\), and so is the only viable subset method when \(p\) is very large.

6.3.3 Choosing the Optimal Model

  • The model containing all of the predictors will always have the smallest RSS and the largest \(R^2\), since these quantities are related to the training error.
  • We wish to choose a model with low test error, not a model with low training error. Recall that training error is usually a poor estimate of test error.
  • Therefore, RSS and \(R^2\) are not suitable for selecting the best model among a collection of models with different numbers of predictors.

6.4 Estimating Test Error


6.4.1 Estimating test error: two approaches

  • We can indirectly estimate test error by making an adjustment to the training error to account for the bias due to overfitting.
  • We can directly estimate the test error, using either a validation set approach or a cross-validation approach, as discussed in previous lectures.
  • We illustrate both approaches next.

6.4.2 \(C_{p}\), AIC, BIC, and Adjusted \(R^2\)

  • These techniques adjust the training error for the model size, and can be used to select among a set of models with different numbers of variables.
  • The next figure displays \(C_{p}\), BIC, and adjusted \(R^2\) for the best model of each size produced by best subset selection on the Credit data set.

6.4.3 Now for some details

  • *Mallows \(C_{P}\):
    \[ C_{P}=\frac{1}{n}(RSS + 2d\hat\sigma_2) \]
    where d is the total # of paramters used and \(\hat\sigma^2\) is an estimate of the variance of the error \(\epsilon\) associated with each response measurement.
  • The AIC criterion is defined for a large class of models fit by maximum likelihood:
    \[ AIC=-2logL+2\cdot d \]
    where \(L\) is the maximized value of the likelihood function for the estimated model.
  • In the case of the linear model with Gaussain errors, maximum likelihood and least squares are the same thing, and \(C_{p}\) and AIC are equivalent.

6.4.4 Details on BIC

\[ BIC = \frac{1}{n}(RSS + log(n)d\hat\sigma^2) \].
* Like \(C_{p}\), the BIC will tend to take on a small value for a model with a low test error, and so generally we select the model that has the lowest BIC value.
* Notice that BIC replaces the \(2d\hat\sigma^2\) used by \(C_{p}\) with a \(log(n)d\hat\sigma^2\) term, where \(n\) is the number of observations.
* Since \(log n > 2\) for any \(n>7\), the BIC statistic generally places a heavier penalty on models with many variables, and hence results in the selection of smaller models than \(C_{p}\).

6.4.5 Adjusted \(R^2\)

  • For a least squares model with \(d\) variables, the adjusted \(R^2\) statistic is calculated as
    Adjusted \(R^2 = 1 - \frac{RSS/(n-d-1)}{TSS/(n-1)}\).
    Where TSS is the total sum of squares.
  • Unlike \(C_{p}\), AIC, and BIC, for which a small value indicates a model with a low test error, a large vlaue of adjusted \(R^2\) indicates a model with a small test error.
  • Maximizing the adjusted \(R^2\) is equivalent to minimizing $. While RSS always decreases as the number of variables in the model increases, may increase or decrease, due to the presence of d in the denominator.
  • Unlike the \(R^2\) statistic, the adjusted \(R^2\) statistic pays a price for the inclusion of unnecessary variables in the model..

6.5 Validation and Cross-Validation


  • Each of the procedures returns a sequence of models \(M_{k}\) indexed by model size \(k=0,1,2,...\) Our job here is to select \(\hat k\). Once selected, we will return model \(M_{\hat k}\).
  • We compute the validation set error or the cross-validation error for each model \(M_{k}\) under consideration, and then select the \(k\) for which the resulting estimated test error is smallest.
  • This procedure has an advantage relatibe to AIC, BIC,\(C_{p}\), and adjusted \(R^2\), in that it provdies a direct estimate of the test error, and doesn't require an estimate of the error variance \(\simga^2\).
  • It can also be used in a wider range of model selection taks, even in cases where it is hard to pinpoint the model degrees of freedom (e.g. the number of predictors in the model) or hard to estimate the error variance \(\simga^2\).

  • One-standard-error-rule. We first calculate the standard error of the estimated test MSE for each model size, and then select the smallest model for which the estimated test error is within one standard error of the lowest point on the curve. What is the rational for this?

6.6 Shrinkage methods and ridge regression


6.6.1 Shrinkage Methods

Ridge Regression and Lasso
* The subset selection method use least squares to fit a linear model that contains a subset of the predictors.
* As an alternative, we can fit a model containing all p predictors using a technique that constrains or regularizes the coefficient estimates, or equivalently, that shrinks and coefficient estimates towards zero.
* It may not be immediately obvious why such a constraint should improve the fit, but it turns out that shrinking the coefficient estimates can significantly reduce their variance.

6.6.2 Ridge regression

  • Recall that the least squares fitting procedures estimates \(\beta_{0},\beta_{1},..,\beta_{P}\) using the values that minimize
    \[ RSS=\sum_{i=1}^n \Big(y_{i}-\beta_{0}-\sum_{j=1}^p \beta_{j}x_{ij} \Big)^2 \].
  • In contrast, the ridge regression coefficient estimates \(\hat\beta^R\) are the values that minimize
    \[ \sum_{i=1}^n \Big( y_{i} - \beta_{0} - \sum_{j=1}^p \beta_{j}x_{ij}\Big)^2 + \lambda\sum_{j=1}^p \beta_{j}^2 = RSS + \lambda\sum_{j=1}^p \beta_{j}^2 \]
    Where \(\lambda \geq 0\) is a turning parameter, to be determined seperately.

6.6.3 Ridge regression: continued

  • As with least squares, ridge regression seeks coefficient estimates the fit the data well, by making the RSS small.
  • However, the second term, \(\lambda\sum_{j} \beta_{j}^2\), called a shrinkage penalty, is small when \(\beta_{1},...,\beta_{p}\) are close to zero, and so it has the effect of shrinking the estimates of \(\beta_{j}\) towards zero.
  • The tuning parameter \(\lambda\) serves to control the relative impact of these two terms on the regression coefficient estimates.
  • Selecing a good value for \(\lambda\) is critical; cross-validation is used for this.

6.6.4 Ridge regression: scaling of predictors

  • The standard least squares coefficient estimates are scale equivariant: multiplying \(X_{j}\) by a constant c simply leads to a scaling of the least squares coefficient estimates by a factor of \(\frac{1}{c}\). In other wordsm regardless of how the \(j^th\) predictor is scaled, \(X_{j}\hat\beta_{j}\) will remain the same.
  • In contrast, the ridge regression coefficient estimates can change substantially when mutliplying a given predictor by a constant, due to the sum of squared coefficients term in the penalty part of the ridge regression objective function.
  • Therefore, it is best to apply ridge regression after standardizing the predictors, using the formula
    \[ \tilde x_{ij} = \frac{x_{ij}}{\sqrt{\frac{1}{n}\sum_{i=1}^n (x_{ij}-\bar x_{j})^2}} \]

6.7 The Lasso


  • Ridge regression does have one obvious disadvantage: unlike subset selection, which will generally select models that involve just a subset of the variables, ridge regression will include all p predictors in the final model
  • The Lasso is a relatively recent alternative to ridge regression that overcomes this disadvantage. The lasso coeffficients, \(\hat\beta_{\lambda}^L\), minimize the quantity
    \[ \sum_{i=1}^n \Big( y_{i}-\beta_{0}-\sum_{j=1}^p \beta_{j}x_{ij} \Big)^2 + \lambda\sum_{j=1}^p |\beta_{j}| = RSS + \lambda\sum_{j=1}^p |\beta_{j}| \]
  • In statistical parlance, the lasso uses an \(L_{1}\) norm of a coefficient vector \(\beta\) is given by \(||\beta||=\sum|\beta_{j}|\).

6.7.1 The Lasso: continued

  • As with ridge regression, the lasso shrinks the coefficient estimates towards zero.
  • However, in the case of the lasso, the \(L_{1}\) penalty has the effect of forcing some of the coefficient estimates to be exactly equal to zero when the tuning parameter \(\lambda\) is sufficiently large.
  • Hence, much like best subset selection, the lasso performs variable selection.
  • We say that the lasso yields sparse models-that is, models that involve only a subset of the variables.
  • As in ridge regression, selection a good value of \(\lambda\) for the lasso is critical; cross-validation is again the method of choice.

6.7.2 The Variable Selection Property of the Lasso

Why is it that the lasso, unlike ridge regression, results in coefficient estimates that are exactly equal to zero?
One can show that the lasso and ridge regression coefficient estimates solve the problems
\[ \underset{x}{\text{minimize}} = \sum_{i=1}^n \Big( y_{i}-\beta_{0}-\sum_{j=1}^p \beta_{j}x_{ij} \Big)^2 \text{subject to} \sum_{j=1}^p |\beta_{j}|\]
and
\[ \underset{x}{\text{minimize}} = \sum_{i=1}^n \Big( y_{i}-\beta_{0}-\sum_{j=1}^p \beta_{j}x_{ij} \Big)^2 \text{subject to} \sum_{j=1}^p \beta_{j}^2 \leq s \]
respectively.

6.7.3 Conclusions

  • Neither ridge regression nor the lasso will universally dominate the other.
  • In general, one might expect the lasso to perform better when the response is a function of only a relatively small number of predictors.
  • However, the number of predictors that is related to the response is never known a priori for real data sets.
  • A technique such as cross-validation can be used in order to determine which approach is better on a particular data set.

6.8 Tuning parameter selection


6.8.1 Selection the Tuning Parameter for Ridge Regression and Lasso

  • As for subset selection, for ridge regressio and lasso we require a method to determine which of the models under consideration is best.
  • That is, we require a method selecting a value for the tuning parameter \(\lambda\) or equivalently, the value of the constraint s.
  • Cross-validation provides a simple way to tackle this problem. We choose a grid of \(\lambda\) values, and compute the cross-validation error rate for each value of \(\lambda\).
  • We then select the tuning parameter value for which the cross-validation error is smallest.
  • Finally, the model is re-fit using all of the available observations and the selected value of the tuning parameter.

6.9 Dimension Reduction Methods


  • The methods that we have discussed so far in this chapter have involved fitting linear regression models, via leaset squares or shrunken approach, using the original predictors, \(X_{1},X_{2},...,X_{p}\).
  • We now explore a class of approaches that transform the predictors and then fit a least squares model using the transformed variables. We will refer to these techniques as dimension reduction methods.

6.9.1 Dimension Reduction Methods: details

  • Let \(Z_{1},Z_{2},...,Z_{M}\) represent \(M < p\) *linear combinations of our original \(p\) predictors. That is,
    \[ Z_{m}=\sum_{j=1}^p \phi_{mj}X_{j} \]
    for some contants \(\phi_{m1},...,\phi_{mp}\).
  • We can then fit the linear regression model
    \[ y_{i} = \theta_{0} + \sum_{m=1}^M \theta_{m}z_{im} + \epsilon_{i} \], \(i = 1,...,n\),
    Using ordinary least squares.
  • Note that in model (2), the regression coefficients are given by \(\theta_{0},\theta_{1},...,\theta_{M}\). If the constants \(\phi_{m1},...,\phi_{mp}\) are chosen wisely, then such dimension reduction approaches can often outperform OLS regression.
  • Notice that from definition (1),
    \[ \sum_{m=1}^M \theta_{m}z_{im} = \sum_{m=1}^M \theta_{m} \sum_{j=1}^p \phi_{mj}x_{ij}=\sum_{j=1}^P\sum_{m=1}^M \theta_{m}\phi_{mj}x_{ij} = \sum_{j=1}^p\beta_{j}x_{ij} \],
    where
    \[ \beta_{j}=\sum_{m=1}^M\beta_{m}\phi_{mj} \].
  • Hence model (2) can be though of as a special case of the original linear regression model.
  • Dimension reduction serves to constrain the estimated \(\beta_{j}\) coefficients, since now they must take the form (3).
  • Can win in the bias-variance tradeoff.

6.10 Principal Components Regression and Partial Least Squares


6.10.1 Principal Components Regression

  • Here we apply principal components analysis (PCA) to define the linear combinations of the predictors, for use in our regression.
  • The first principal component is that (normalized) linear combination of the variables. with the largest variance.
  • The second principla component has largest variance, subject to being uncorrelated with the first.
  • And so on.
  • Hence with many correlated original variables, we replace them with a small set of principle components that capture their joint variation.

6.10.2 Partial Least Squares

  • PCR identifies linear combinations, or directions, that best represent the predictors \(X_{1},...,X_{p}\).
  • These directions are identified in an unsupervised way, since the response \(Y\) is not used to help determine the principal component directions.
  • That is, the response does not supervise the identification of the principal components.
  • Consequently, PCR suffers from a potentially serious drawback: there is no guarantee that the directions that best explain the predictors will also be the best directions to use for predicting the response.
  • Like PCR, PLS is a dimension reduction method, which first identifies a new set of features \(Z_{1},...,Z_{M}\) that are linear combinations of the orginial features, and then fits a linear model via OLS using these M new features.
  • But unlike PCR, PLS identifies these new features in a supervised way - that is, it makes use of teh response Y in order to identify new features that not only approximate the old features well, but also that are related to the response.
  • Roughly speaking, the PLS approach attempts to find directions that help explain both the response and the predictors.

6.10.3 Details of Partial Least Squares

  • After standardizing the \(p\) predictors, PLS computes the first direction \(Z_{1}\) by setting each \(\phi_{1j}\) in (1) equal to the coefficient from the simple linear regression of \(Y\) onto \(X_{j}\).
  • One can show that this coefficient is proportional to the correlation between \(Y\) and \(X_{j}\).
  • Hence, in computing \(Z_{1}=\sum_{j=1}^p \phi_{1j}X_{j}\), PLS places the highest weight on the variables that are most strongly related to the response.
  • Subsequent directions are found by taking residuals and then repeated the above prescription.

6.10.4 Summary

  • Model selection methods are an essential tool for data analysis, especially for big datasets involving many predictors.
  • Research into methods that give sparsity, such as the lasso is an especially hot area.
  • Later, we will return to sparsity in more detail, and will describe related approaches such as the elastic net.

7 ##Model Selection in R

library(ISLR)
summary(Hitters)
##      AtBat            Hits           HmRun            Runs       
##  Min.   : 19.0   Min.   :  1.0   Min.   : 0.00   Min.   :  0.00  
##  1st Qu.:282.5   1st Qu.: 71.5   1st Qu.: 5.00   1st Qu.: 33.50  
##  Median :413.0   Median :103.0   Median : 9.00   Median : 52.00  
##  Mean   :403.6   Mean   :107.8   Mean   :11.62   Mean   : 54.75  
##  3rd Qu.:526.0   3rd Qu.:141.5   3rd Qu.:18.00   3rd Qu.: 73.00  
##  Max.   :687.0   Max.   :238.0   Max.   :40.00   Max.   :130.00  
##       RBI             Walks            Years            CAtBat       
##  Min.   :  0.00   Min.   :  0.00   Min.   : 1.000   Min.   :   19.0  
##  1st Qu.: 30.00   1st Qu.: 23.00   1st Qu.: 4.000   1st Qu.:  842.5  
##  Median : 47.00   Median : 37.00   Median : 6.000   Median : 1931.0  
##  Mean   : 51.49   Mean   : 41.11   Mean   : 7.312   Mean   : 2657.5  
##  3rd Qu.: 71.00   3rd Qu.: 57.00   3rd Qu.:10.000   3rd Qu.: 3890.5  
##  Max.   :121.00   Max.   :105.00   Max.   :24.000   Max.   :14053.0  
##      CHits            CHmRun           CRuns             CRBI       
##  Min.   :   4.0   Min.   :  0.00   Min.   :   2.0   Min.   :   3.0  
##  1st Qu.: 212.0   1st Qu.: 15.00   1st Qu.: 105.5   1st Qu.:  95.0  
##  Median : 516.0   Median : 40.00   Median : 250.0   Median : 230.0  
##  Mean   : 722.2   Mean   : 69.24   Mean   : 361.2   Mean   : 330.4  
##  3rd Qu.:1054.0   3rd Qu.: 92.50   3rd Qu.: 497.5   3rd Qu.: 424.5  
##  Max.   :4256.0   Max.   :548.00   Max.   :2165.0   Max.   :1659.0  
##      CWalks       League  Division    PutOuts          Assists     
##  Min.   :   1.0   A:139   E:129    Min.   :   0.0   Min.   :  0.0  
##  1st Qu.:  71.0   N:124   W:134    1st Qu.: 113.5   1st Qu.:  8.0  
##  Median : 174.0                    Median : 224.0   Median : 45.0  
##  Mean   : 260.3                    Mean   : 290.7   Mean   :118.8  
##  3rd Qu.: 328.5                    3rd Qu.: 322.5   3rd Qu.:192.0  
##  Max.   :1566.0                    Max.   :1377.0   Max.   :492.0  
##      Errors           Salary       NewLeague
##  Min.   : 0.000   Min.   :  67.5   A:141    
##  1st Qu.: 3.000   1st Qu.: 190.0   N:122    
##  Median : 7.000   Median : 425.0            
##  Mean   : 8.593   Mean   : 535.9            
##  3rd Qu.:13.000   3rd Qu.: 750.0            
##  Max.   :32.000   Max.   :2460.0

There are some missing values here, so before we proceed we will remove them:

Hitters=na.omit(Hitters)
with(Hitters, sum(is.na(Salary)))
## [1] 0

7.1 Best Subset regression

we will now use the package 'leaps' to evaluate all the best-subset models.

#install.packages("leaps")
library(leaps)
regfit.full = regsubsets(Salary~.,data=Hitters)
summary(regfit.full)
## Subset selection object
## Call: regsubsets.formula(Salary ~ ., data = Hitters)
## 19 Variables  (and intercept)
##            Forced in Forced out
## AtBat          FALSE      FALSE
## Hits           FALSE      FALSE
## HmRun          FALSE      FALSE
## Runs           FALSE      FALSE
## RBI            FALSE      FALSE
## Walks          FALSE      FALSE
## Years          FALSE      FALSE
## CAtBat         FALSE      FALSE
## CHits          FALSE      FALSE
## CHmRun         FALSE      FALSE
## CRuns          FALSE      FALSE
## CRBI           FALSE      FALSE
## CWalks         FALSE      FALSE
## LeagueN        FALSE      FALSE
## DivisionW      FALSE      FALSE
## PutOuts        FALSE      FALSE
## Assists        FALSE      FALSE
## Errors         FALSE      FALSE
## NewLeagueN     FALSE      FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: exhaustive
##          AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns
## 1  ( 1 ) " "   " "  " "   " "  " " " "   " "   " "    " "   " "    " "  
## 2  ( 1 ) " "   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "  
## 3  ( 1 ) " "   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "  
## 4  ( 1 ) " "   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "  
## 5  ( 1 ) "*"   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "  
## 6  ( 1 ) "*"   "*"  " "   " "  " " "*"   " "   " "    " "   " "    " "  
## 7  ( 1 ) " "   "*"  " "   " "  " " "*"   " "   "*"    "*"   "*"    " "  
## 8  ( 1 ) "*"   "*"  " "   " "  " " "*"   " "   " "    " "   "*"    "*"  
##          CRBI CWalks LeagueN DivisionW PutOuts Assists Errors NewLeagueN
## 1  ( 1 ) "*"  " "    " "     " "       " "     " "     " "    " "       
## 2  ( 1 ) "*"  " "    " "     " "       " "     " "     " "    " "       
## 3  ( 1 ) "*"  " "    " "     " "       "*"     " "     " "    " "       
## 4  ( 1 ) "*"  " "    " "     "*"       "*"     " "     " "    " "       
## 5  ( 1 ) "*"  " "    " "     "*"       "*"     " "     " "    " "       
## 6  ( 1 ) "*"  " "    " "     "*"       "*"     " "     " "    " "       
## 7  ( 1 ) " "  " "    " "     "*"       "*"     " "     " "    " "       
## 8  ( 1 ) " "  "*"    " "     "*"       "*"     " "     " "    " "

It gives by default best-subsets up to size 8; lets increase that to 19, i.e. all the variables

regfit.full=regsubsets(Salary~.,data=Hitters, nvmax=19)
reg.summary =  summary(regfit.full)
names(reg.summary)
## [1] "which"  "rsq"    "rss"    "adjr2"  "cp"     "bic"    "outmat" "obj"
plot(reg.summary$cp, xlab="number of Variables", ylab="Cp")
which.min(reg.summary$cp)
## [1] 10
points(10,reg.summary$cp[10],pch=20,col="red")

There is a plot method for the 'regsubsets' object

plot(regfit.full,scale="Cp")

coef(regfit.full,10)
##  (Intercept)        AtBat         Hits        Walks       CAtBat 
##  162.5354420   -2.1686501    6.9180175    5.7732246   -0.1300798 
##        CRuns         CRBI       CWalks    DivisionW      PutOuts 
##    1.4082490    0.7743122   -0.8308264 -112.3800575    0.2973726 
##      Assists 
##    0.2831680

7.2 Forward Stepwise Selection

Here we use the 'regsubsets' function but specify the 'method = "forward"' option:

regfit.fwd = regsubsets(Salary~., data = Hitters, nvmax = 19, method = "forward")  
summary(regfit.fwd)   
## Subset selection object
## Call: regsubsets.formula(Salary ~ ., data = Hitters, nvmax = 19, method = "forward")
## 19 Variables  (and intercept)
##            Forced in Forced out
## AtBat          FALSE      FALSE
## Hits           FALSE      FALSE
## HmRun          FALSE      FALSE
## Runs           FALSE      FALSE
## RBI            FALSE      FALSE
## Walks          FALSE      FALSE
## Years          FALSE      FALSE
## CAtBat         FALSE      FALSE
## CHits          FALSE      FALSE
## CHmRun         FALSE      FALSE
## CRuns          FALSE      FALSE
## CRBI           FALSE      FALSE
## CWalks         FALSE      FALSE
## LeagueN        FALSE      FALSE
## DivisionW      FALSE      FALSE
## PutOuts        FALSE      FALSE
## Assists        FALSE      FALSE
## Errors         FALSE      FALSE
## NewLeagueN     FALSE      FALSE
## 1 subsets of each size up to 19
## Selection Algorithm: forward
##           AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns
## 1  ( 1 )  " "   " "  " "   " "  " " " "   " "   " "    " "   " "    " "  
## 2  ( 1 )  " "   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "  
## 3  ( 1 )  " "   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "  
## 4  ( 1 )  " "   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "  
## 5  ( 1 )  "*"   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "  
## 6  ( 1 )  "*"   "*"  " "   " "  " " "*"   " "   " "    " "   " "    " "  
## 7  ( 1 )  "*"   "*"  " "   " "  " " "*"   " "   " "    " "   " "    " "  
## 8  ( 1 )  "*"   "*"  " "   " "  " " "*"   " "   " "    " "   " "    "*"  
## 9  ( 1 )  "*"   "*"  " "   " "  " " "*"   " "   "*"    " "   " "    "*"  
## 10  ( 1 ) "*"   "*"  " "   " "  " " "*"   " "   "*"    " "   " "    "*"  
## 11  ( 1 ) "*"   "*"  " "   " "  " " "*"   " "   "*"    " "   " "    "*"  
## 12  ( 1 ) "*"   "*"  " "   "*"  " " "*"   " "   "*"    " "   " "    "*"  
## 13  ( 1 ) "*"   "*"  " "   "*"  " " "*"   " "   "*"    " "   " "    "*"  
## 14  ( 1 ) "*"   "*"  "*"   "*"  " " "*"   " "   "*"    " "   " "    "*"  
## 15  ( 1 ) "*"   "*"  "*"   "*"  " " "*"   " "   "*"    "*"   " "    "*"  
## 16  ( 1 ) "*"   "*"  "*"   "*"  "*" "*"   " "   "*"    "*"   " "    "*"  
## 17  ( 1 ) "*"   "*"  "*"   "*"  "*" "*"   " "   "*"    "*"   " "    "*"  
## 18  ( 1 ) "*"   "*"  "*"   "*"  "*" "*"   "*"   "*"    "*"   " "    "*"  
## 19  ( 1 ) "*"   "*"  "*"   "*"  "*" "*"   "*"   "*"    "*"   "*"    "*"  
##           CRBI CWalks LeagueN DivisionW PutOuts Assists Errors NewLeagueN
## 1  ( 1 )  "*"  " "    " "     " "       " "     " "     " "    " "       
## 2  ( 1 )  "*"  " "    " "     " "       " "     " "     " "    " "       
## 3  ( 1 )  "*"  " "    " "     " "       "*"     " "     " "    " "       
## 4  ( 1 )  "*"  " "    " "     "*"       "*"     " "     " "    " "       
## 5  ( 1 )  "*"  " "    " "     "*"       "*"     " "     " "    " "       
## 6  ( 1 )  "*"  " "    " "     "*"       "*"     " "     " "    " "       
## 7  ( 1 )  "*"  "*"    " "     "*"       "*"     " "     " "    " "       
## 8  ( 1 )  "*"  "*"    " "     "*"       "*"     " "     " "    " "       
## 9  ( 1 )  "*"  "*"    " "     "*"       "*"     " "     " "    " "       
## 10  ( 1 ) "*"  "*"    " "     "*"       "*"     "*"     " "    " "       
## 11  ( 1 ) "*"  "*"    "*"     "*"       "*"     "*"     " "    " "       
## 12  ( 1 ) "*"  "*"    "*"     "*"       "*"     "*"     " "    " "       
## 13  ( 1 ) "*"  "*"    "*"     "*"       "*"     "*"     "*"    " "       
## 14  ( 1 ) "*"  "*"    "*"     "*"       "*"     "*"     "*"    " "       
## 15  ( 1 ) "*"  "*"    "*"     "*"       "*"     "*"     "*"    " "       
## 16  ( 1 ) "*"  "*"    "*"     "*"       "*"     "*"     "*"    " "       
## 17  ( 1 ) "*"  "*"    "*"     "*"       "*"     "*"     "*"    "*"       
## 18  ( 1 ) "*"  "*"    "*"     "*"       "*"     "*"     "*"    "*"       
## 19  ( 1 ) "*"  "*"    "*"     "*"       "*"     "*"     "*"    "*"
plot(regfit.fwd, scale = "Cp")   

7.3 Model Selection Using a Validation Set

Lets make a training and validation set, so that we can choose a good subset model.
We will do it using a slightly different approach from what was done in the book.

dim(Hitters)
## [1] 263  20
set.seed(1)
train = sample(seq(263),180,replace=FALSE)
train
##   [1]  70  98 150 237  53 232 243 170 161  16 259  45 173  97 192 124 178
##  [18] 245  94 190 228  52 158  31  64  92   4  91 205  80 113 140 115  43
##  [35] 244 153 181  25 163  93 184 144 174 122 117 251   6 104 241 149 102
##  [52] 183 224 242  15  21  66 107 136  83 186  60 211  67 130 210  95 151
##  [69]  17 256 207 162 200 239 236 168 249  73 222 177 234 199 203  59 235
##  [86]  37 126  22 230 226  42  11 110 214 132 134  77  69 188 100 206  58
## [103]  44 159 101  34 208  75 185 201 261 112  54  65  23   2 106 254 257
## [120] 154 142  71 166 221 105  63 143  29 240 212 167 172   5  84 120 133
## [137]  72 191 248 138 182  74 179 135  87 196 157 119  13  99 263 125 247
## [154]  50  55  20  57   8  30 194 139 238  46  78  88  41   7  33 141  32
## [171] 180 164 213  36 215  79 225 229 198  76
regfit.fwd = regsubsets(Salary~.,data=Hitters[train,], nvmax=19, method = "forward")

Now we will make predictions on the observations not used for training. We know there are 19 models, so we set up some vectors to record the errors. We have to do a bit of work here, because there is no predict method for 'regsubsets'.

val.errors=rep(NA,19)
x.test=model.matrix(Salary~., data=Hitters[-train,]) # notice the index!    
for(i in 1:19){
    coefi = coef(regfit.fwd, id = i)
    pred = x.test[,names(coefi)]%*%coefi
    val.errors[i]=mean((Hitters$Salary[-train]-pred)^2)
}
plot(sqrt(val.errors), ylab="Root MSE", ylim=c(300,400), pch=19, type="b")
points(sqrt(regfit.fwd$rss[-1]/180), col="blue", pch=19,type="b")
legend("topright", legend=c("Training", "Validation"), col=c("blue","black"),pch=19)

As we expect, the training error goes down monotonically as the model gets bigger, but not so for the validation error.

This was a little tedious - not having a predict method for 'regsubsets'.
So we will write one!

predict.regsubsets=function(object, newdata,id,...){
    form=as.formula(object$call[[2]])
    mat = model.matrix(form, newdata)
    coefi=coef(object,id=id)
    mat[,names(coefi)]%*%coefi
}

7.4 Model Selection by Cross-Validation

We will do 10-fold cross-validation. Its really easy!

set.seed(11)
folds = sample(rep(1:10, length=nrow(Hitters)))
folds
##   [1]  3  1  4  4  7  7  3  5  5  2  5  2  8  3  3  3  9  2  9  8 10  5  8
##  [24]  5  5  5  5 10 10  4  4  7  6  7  7  7  3  4  8  3  6  8 10  4  3  9
##  [47]  9  3  4  9  8  7 10  6 10  3  6  9  4  2  8  2  5  6 10  7  2  8  8
##  [70]  1  3  6  2  5  8  1  1  2  8  1 10  1  2  3  6  6  5  8  8 10  4  2
##  [93]  6  1  7  4  8  3  7  8  7  1 10  1  6  2  9 10  1  7  7  4  7  4 10
## [116]  3  6 10  6  6  9  8 10  6  7  9  6  7  1 10  2  2  5  9  9  6  1  1
## [139]  2  9  4 10  5  3  7  7 10 10  9  3  3  7  3  1  4  6  6 10  4  9  9
## [162]  1  3  6  8 10  8  5  4  5  6  2  9 10  3  7  7  6  6  2  3  2  4  4
## [185]  4  4  8  2  3  5  9  9 10  2  1  3  9  6  7  3  1  9  4 10 10  8  8
## [208]  8  2  5  9  8 10  5  8  2  4  1  4  4  5  5  2  1  9  5  2  9  9  5
## [231]  3  2  1  9  1  7  2  5  8  1  1  7  6  6  4  5 10  5  7  4  8  6  9
## [254]  1  2  5  7  1  3  1  3  1  2
table(folds)
## folds
##  1  2  3  4  5  6  7  8  9 10 
## 27 27 27 26 26 26 26 26 26 26
cv.errors = matrix(NA,10,19) # make matrix for our errors
for(k in 1:10){
    best.fit=regsubsets(Salary~., data=Hitters[folds!=k,], nvmax=19, method="forward")
    for(i in 1:19){
        pred = predict(best.fit, Hitters[folds==k,], id=i)
        cv.errors[k,i] = mean( (Hitters$Salary[folds==k] - pred)^2)
    }
}
rmse.cv = sqrt(apply(cv.errors,2,mean))
plot(rmse.cv,pch=19,type="b")

7.5 Ridge Regression and the Lasso

We will use the package glment, which does not use the model formula language, so we will set up an x and y.

#install.packages("glmnet", repos = "http://cran.us.r-project.org")
library(glmnet)
## Warning: package 'glmnet' was built under R version 3.1.3
## Loading required package: Matrix
## Warning: package 'Matrix' was built under R version 3.1.3
## 
## Attaching package: 'Matrix'
## 
## The following objects are masked from 'package:base':
## 
##     crossprod, tcrossprod
## 
## Loading required package: foreach
## Loaded glmnet 2.0-2
x = model.matrix(Salary~., data = Hitters)
y = Hitters$Salary 

First we will fit a ridge-regression model. This is achieved by calling glmnet with alpha=0 (see the helpfile). There is also a cv.glmnet function which will do the cross-validation for us.

fit.ridge = glmnet(x,y,alpha=0) # 0 ridge 1 lasso. between are elastic net
plot(fit.ridge,xvar="lambda", label=TRUE)

cv.ridge=cv.glmnet(x,y,alpha=0)
plot(cv.ridge)

Now we fit a lasso model; for this we use the default alpha=1

fit.lasso=glmnet(x,y)
plot(fit.lasso, xvar="lambda", label=TRUE)

cv.lasso=cv.glmnet(x,y)
plot(cv.lasso)

coef(cv.lasso)
## 21 x 1 sparse Matrix of class "dgCMatrix"
##                        1
## (Intercept) 127.95694754
## (Intercept)   .         
## AtBat         .         
## Hits          1.42342566
## HmRun         .         
## Runs          .         
## RBI           .         
## Walks         1.58214111
## Years         .         
## CAtBat        .         
## CHits         .         
## CHmRun        .         
## CRuns         0.16027975
## CRBI          0.33667715
## CWalks        .         
## LeagueN       .         
## DivisionW    -8.06171262
## PutOuts       0.08393604
## Assists       .         
## Errors        .         
## NewLeagueN    .

Suppose we want to use our earlier train/validation division to select the lambda for the lasso.
This is easy to do

lasso.tr=glmnet(x[train,],y[train])
lasso.tr
## 
## Call:  glmnet(x = x[train, ], y = y[train]) 
## 
##       Df    %Dev    Lambda
##  [1,]  0 0.00000 246.40000
##  [2,]  1 0.05013 224.50000
##  [3,]  1 0.09175 204.60000
##  [4,]  2 0.13840 186.40000
##  [5,]  2 0.18000 169.80000
##  [6,]  3 0.21570 154.80000
##  [7,]  3 0.24710 141.00000
##  [8,]  3 0.27320 128.50000
##  [9,]  4 0.30010 117.10000
## [10,]  4 0.32360 106.70000
## [11,]  4 0.34310  97.19000
## [12,]  4 0.35920  88.56000
## [13,]  5 0.37360  80.69000
## [14,]  5 0.38900  73.52000
## [15,]  5 0.40190  66.99000
## [16,]  5 0.41260  61.04000
## [17,]  5 0.42140  55.62000
## [18,]  5 0.42880  50.67000
## [19,]  5 0.43490  46.17000
## [20,]  5 0.43990  42.07000
## [21,]  5 0.44410  38.33000
## [22,]  5 0.44760  34.93000
## [23,]  6 0.45140  31.83000
## [24,]  7 0.45480  29.00000
## [25,]  7 0.45770  26.42000
## [26,]  7 0.46010  24.07000
## [27,]  8 0.46220  21.94000
## [28,]  8 0.46380  19.99000
## [29,]  8 0.46520  18.21000
## [30,]  8 0.46630  16.59000
## [31,]  8 0.46730  15.12000
## [32,]  8 0.46810  13.78000
## [33,]  9 0.47110  12.55000
## [34,]  9 0.47380  11.44000
## [35,]  9 0.47620  10.42000
## [36,] 10 0.48050   9.49500
## [37,]  9 0.48450   8.65200
## [38,] 10 0.48770   7.88300
## [39,] 10 0.49360   7.18300
## [40,] 11 0.49890   6.54500
## [41,] 12 0.50450   5.96300
## [42,] 12 0.51010   5.43400
## [43,] 13 0.51470   4.95100
## [44,] 13 0.51850   4.51100
## [45,] 13 0.52170   4.11000
## [46,] 14 0.52440   3.74500
## [47,] 14 0.52670   3.41200
## [48,] 14 0.52870   3.10900
## [49,] 14 0.53030   2.83300
## [50,] 14 0.53160   2.58100
## [51,] 15 0.53280   2.35200
## [52,] 16 0.53420   2.14300
## [53,] 17 0.53580   1.95300
## [54,] 17 0.53760   1.77900
## [55,] 17 0.53890   1.62100
## [56,] 17 0.54000   1.47700
## [57,] 17 0.54090   1.34600
## [58,] 17 0.54160   1.22600
## [59,] 17 0.54220   1.11700
## [60,] 17 0.54280   1.01800
## [61,] 17 0.54320   0.92770
## [62,] 17 0.54360   0.84530
## [63,] 17 0.54380   0.77020
## [64,] 18 0.54410   0.70180
## [65,] 18 0.54430   0.63940
## [66,] 18 0.54450   0.58260
## [67,] 18 0.54470   0.53090
## [68,] 18 0.54490   0.48370
## [69,] 19 0.54510   0.44070
## [70,] 19 0.54520   0.40160
## [71,] 19 0.54530   0.36590
## [72,] 19 0.54540   0.33340
## [73,] 19 0.54550   0.30380
## [74,] 19 0.54560   0.27680
## [75,] 19 0.54570   0.25220
## [76,] 19 0.54570   0.22980
## [77,] 19 0.54580   0.20940
## [78,] 19 0.54580   0.19080
## [79,] 19 0.54590   0.17380
## [80,] 19 0.54590   0.15840
## [81,] 19 0.54590   0.14430
## [82,] 19 0.54590   0.13150
## [83,] 19 0.54600   0.11980
## [84,] 18 0.54600   0.10920
## [85,] 18 0.54600   0.09948
## [86,] 18 0.54600   0.09064
## [87,] 18 0.54600   0.08259
## [88,] 19 0.54600   0.07525
## [89,] 19 0.54600   0.06856
pred=predict(lasso.tr,x[-train,])
dim(pred)
## [1] 83 89
rmse = sqrt(apply((y[-train]-pred)^2,2,mean)) # y is broadcast to the dimension of pred
plot(log(lasso.tr$lambda), rmse, type="b", xlab="Log(lambda)")

lam.best = lasso.tr$lambda[order(rmse)[1]]
lam.best
## [1] 19.98706
coef(lasso.tr, s = lam.best)
## 21 x 1 sparse Matrix of class "dgCMatrix"
##                        1
## (Intercept)   77.8923666
## (Intercept)    .        
## AtBat          .        
## Hits           0.1591252
## HmRun          .        
## Runs           .        
## RBI            1.7340039
## Walks          3.4657091
## Years          .        
## CAtBat         .        
## CHits          .        
## CHmRun         .        
## CRuns          0.5386855
## CRBI           .        
## CWalks         .        
## LeagueN       30.0493021
## DivisionW   -113.8317016
## PutOuts        0.2915409
## Assists        .        
## Errors         .        
## NewLeagueN     2.0367518

8 Chapter 7 Moving Beyond Linearity


8.1 Polynomial and Step Functions


Moving Beyond Linearity
The truth is never linear!
Or almost never!

But often the linearity assumption is good enough.

When its not...
* polynomials,
* step functions,
* splines,
* local regression, and
* generalized additive models

offer a lot of flexibility, without losing the ease and interpretability of linear models.

8.1.1 Polynomial Regression

\(y_{i}=\beta_{0}+\beta_{1}x_{i}+\beta_{2}x_{i}^2+\beta_{3}x_{i}^3+...+\beta_{d}x_{i}^d+\epsilon_{i}\)

8.1.2 Details

  • Create new variables \(X_{1} = X,X_{2}=X^2\), etc. and then treat as multiple linear regression.
  • Not really interested in the coefficients; more interested in the fitted function values at any value \(x_{0}\):
    \[ \hat f(x_{0})=\hat\beta_{0}+\hat\beta_{2}^2+\hat\beta_{3}^3+\hat\beta_{4}^4 \]

  • Since \(\hat f(x_{0})\) is a linear function of \(\hat\beta_{\ell}\), can get a simple expression for pointwise-variance \(Var[\hat f(x_{0})]\) at any value \(x_{0}\). In the figure we have computed the fit and pointwise standard errors on a grid of values for \(x_{0}\). We show \(\hat f(x_{0}) \pm 2 \cdot se[\hat f(x_{0})]\).
  • We either fix the degree \(d\) at some reasonably low value, else use cross-validation to choose \(d\).

8.1.3 Details continued

  • Logistic regression follows naturally. For example, in figure we model
    \[ Pr(y_{i} > 250\mid x_{i}) = \frac{exp(\beta_{0}+\beta_{2}x_{i}^2 +...+\beta_{d}x_{i}^d)}{1 + exp(\beta_{0}+\beta_{1}x_{i}+\beta_{2}x_{i}^2+..+\beta_{d}x_{i}^d)} \]
  • To get confidence intervals, compute upper and lower bounds on the logit scale, and then invert to get on probability scale.
  • Can do seperately on several variables--just stack the variables into one matrix, and seperate out the pieces afterwards (see GAMs later).
  • Caveat: polynomials have notorious tail behavior - very bad for extrapolation.
  • Can fit using {r}y ~ poly(x,degree = 3) in formula.

8.1.4 Step Functions

Another way of creating transformations of a variable- cut the variable into distinct regions
\[ C_{1}(X)=I(X<35), C_{2}(X)=I(35\leq X < 50),...,C_{3}(X)=I(X \geq 65) \]
* Easy to work with. Creates a series of dummy variables representing each group.
* Useful way of creating interactions that are easy to interpret. For example, interaction effect of Year and Age:
\[ I(\text{Year}<2005) \cdot \text{Age}, I(\text{Year} \geq 2005) \cdot \text{Age}} \]
would allow for different linear functions in each age category.
* In R:{r}I(year < 2005) or {r}cut(age,c(18,25,40,65,90)).
* Choice of cutpoints or Knots can be problematic. For creating nonlinearities, smoother alternatives such as splines are available.

8.2 Peicewise-Polynomials and Splines


8.2.1 Piecewise Polynomials

  • instead of a single polynomial in X over its whole domain, we can rather use different polynomials in regions defined by knots. E.g. (see figure)

9 Examples:

Install and load necessary datasets and functions

library("MASS")
library("ISLR")
library("ggplot2")
## Warning: package 'ggplot2' was built under R version 3.1.3
## 
## Attaching package: 'ggplot2'
## 
## The following object is masked from 'Auto':
## 
##     mpg
library("car")
## Warning: package 'car' was built under R version 3.1.3
## 
## Attaching package: 'car'
## 
## The following object is masked from 'package:boot':
## 
##     logit
library("dplyr")
## Warning: package 'dplyr' was built under R version 3.1.2
## 
## Attaching package: 'dplyr'
## 
## The following object is masked from 'package:MASS':
## 
##     select
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library("tidyr")
## Warning: package 'tidyr' was built under R version 3.1.2
## 
## Attaching package: 'tidyr'
## 
## The following object is masked from 'package:Matrix':
## 
##     expand

9.1 3.6.2 Simple Linear Regression

The MASS library contains the Boston data set, which records medv (median house value) for 506 neighborhoods around Boston. We will seek to predict medv using 13 predictors such as rm (average number of rooms per house), age (average age of houses), and lstat (percent of households with low socioeconomic status).

#View(Boston)
names(Boston)
##  [1] "crim"    "zn"      "indus"   "chas"    "nox"     "rm"      "age"    
##  [8] "dis"     "rad"     "tax"     "ptratio" "black"   "lstat"   "medv"
#?Boston

We will start by using the lm() function to fit a simple linear regression model, with medv as the response and lstat as the predictor. The basic lm() syntax is lm(y∼x,data), where y is the response, x is the predictor, and data is the data set in which these two variables are kept.

lm.fit = lm(medv~lstat, data = Boston)
lm.fit
## 
## Call:
## lm(formula = medv ~ lstat, data = Boston)
## 
## Coefficients:
## (Intercept)        lstat  
##       34.55        -0.95
summary(lm.fit)
## 
## Call:
## lm(formula = medv ~ lstat, data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.168  -3.990  -1.318   2.034  24.500 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 34.55384    0.56263   61.41   <2e-16 ***
## lstat       -0.95005    0.03873  -24.53   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.216 on 504 degrees of freedom
## Multiple R-squared:  0.5441, Adjusted R-squared:  0.5432 
## F-statistic: 601.6 on 1 and 504 DF,  p-value: < 2.2e-16

We can use the names() function in order to find out what other pieces of information are stored in lm.fit. Although we can extract these quan- tities by name—e.g. lm.fit$coefficients—it is safer to use the extractor functions like coef() to access them.

names(lm.fit)
##  [1] "coefficients"  "residuals"     "effects"       "rank"         
##  [5] "fitted.values" "assign"        "qr"            "df.residual"  
##  [9] "xlevels"       "call"          "terms"         "model"
coef(lm.fit)
## (Intercept)       lstat 
##  34.5538409  -0.9500494

In order to obtain a confidence interval for the coefficient estimates, we can use the confint() command.

confint(lm.fit)
##                 2.5 %     97.5 %
## (Intercept) 33.448457 35.6592247
## lstat       -1.026148 -0.8739505

The predict() function can be used to produce confidence intervals and prediction intervals for the prediction of medv for a given value of lstat.

predict(lm.fit, data.frame(lstat=(c(5,10,15))), interval = "confidence")
##        fit      lwr      upr
## 1 29.80359 29.00741 30.59978
## 2 25.05335 24.47413 25.63256
## 3 20.30310 19.73159 20.87461
predict(lm.fit, data.frame(lstat=(c(5,10,15))), interval = "prediction")
##        fit       lwr      upr
## 1 29.80359 17.565675 42.04151
## 2 25.05335 12.827626 37.27907
## 3 20.30310  8.077742 32.52846

We will now plot medv and lstat along with the least squares regression line using the plot() and abline() functions.

Next we examine some diagnostic plots, several of which were discussed in Section 3.3.3. Four diagnostic plots are automatically produced by ap- plying the plot() function directly to the output from lm(). In general, this command will produce one plot at a time, and hitting Enter will generate the next plot. However, it is often convenient to view all four plots together. We can achieve this by using the par() function, which tells R to split the display screen into separate panels so that multiple plots can be viewed si- multaneously. For example, par(mfrow=c(2,2)) divides the plotting region into a 2 × 2 grid of panels.

par(mfrow=c(2,2))
plot(lm.fit)

Alternatively, we can compute the residuals from a linear regression fit using the residuals() function. The function rstudent() will return the studentized residuals, and we can use this function to plot the residuals against the fitted values.

On the basis of the residual plots, there is some evidence of non-linearity. Leverage statistics can be computed for any number of predictors using the hatvalues() function.

## 375 
## 375

9.2 3.6.3 Multiple Linear Regression

In order to fit a multiple linear regression model using least squares, we again use the lm() function. The syntax lm(y∼x1+x2+x3) is used to fit a model with three predictors, x1, x2, and x3. The summary() function now outputs the regression coefficients for all the predictors.

lm.fit=lm(medv ~ lstat+age,data=Boston) 
summary(lm.fit)
## 
## Call:
## lm(formula = medv ~ lstat + age, data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.981  -3.978  -1.283   1.968  23.158 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 33.22276    0.73085  45.458  < 2e-16 ***
## lstat       -1.03207    0.04819 -21.416  < 2e-16 ***
## age          0.03454    0.01223   2.826  0.00491 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.173 on 503 degrees of freedom
## Multiple R-squared:  0.5513, Adjusted R-squared:  0.5495 
## F-statistic:   309 on 2 and 503 DF,  p-value: < 2.2e-16

The Boston data set contains 13 variables, and so it would be cumbersome to have to type all of these in order to perform a regression using all of the predictors. Instead, we can use the following short-hand:

lm.fit=lm(medv ~., data=Boston)
summary(lm.fit)
## 
## Call:
## lm(formula = medv ~ ., data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.595  -2.730  -0.518   1.777  26.199 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.646e+01  5.103e+00   7.144 3.28e-12 ***
## crim        -1.080e-01  3.286e-02  -3.287 0.001087 ** 
## zn           4.642e-02  1.373e-02   3.382 0.000778 ***
## indus        2.056e-02  6.150e-02   0.334 0.738288    
## chas         2.687e+00  8.616e-01   3.118 0.001925 ** 
## nox         -1.777e+01  3.820e+00  -4.651 4.25e-06 ***
## rm           3.810e+00  4.179e-01   9.116  < 2e-16 ***
## age          6.922e-04  1.321e-02   0.052 0.958229    
## dis         -1.476e+00  1.995e-01  -7.398 6.01e-13 ***
## rad          3.060e-01  6.635e-02   4.613 5.07e-06 ***
## tax         -1.233e-02  3.760e-03  -3.280 0.001112 ** 
## ptratio     -9.527e-01  1.308e-01  -7.283 1.31e-12 ***
## black        9.312e-03  2.686e-03   3.467 0.000573 ***
## lstat       -5.248e-01  5.072e-02 -10.347  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.745 on 492 degrees of freedom
## Multiple R-squared:  0.7406, Adjusted R-squared:  0.7338 
## F-statistic: 108.1 on 13 and 492 DF,  p-value: < 2.2e-16

We can access the individual components of a summary object by name (type ?summary.lm to see what is available). Hence summary(lm.fit)\(r.sq gives us the R2, and summary(lm.fit)\)sigma gives us the RSE. The vif() function, part of the car package, can be used to compute variance inflation factors. Most VIF’s are low to moderate for this data. The car package is not part of the base R installation so it must be downloaded the first time you use it via the install.packages option in R. 

?summary.lm
summary(lm.fit)$r.sq
## [1] 0.7406427
summary(lm.fit)$sigma
## [1] 4.745298
vif(lm.fit)
##     crim       zn    indus     chas      nox       rm      age      dis 
## 1.792192 2.298758 3.991596 1.073995 4.393720 1.933744 3.100826 3.955945 
##      rad      tax  ptratio    black    lstat 
## 7.484496 9.008554 1.799084 1.348521 2.941491

What if we would like to perform a regression using all of the variables but one? For example, in the above regression output, age has a high p-value. So we may wish to run a regression excluding this predictor. The following syntax results in a regression using all predictors except age.

lm.fit1=lm(medv~.-age,data=Boston) 
summary(lm.fit1)
## 
## Call:
## lm(formula = medv ~ . - age, data = Boston)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.6054  -2.7313  -0.5188   1.7601  26.2243 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  36.436927   5.080119   7.172 2.72e-12 ***
## crim         -0.108006   0.032832  -3.290 0.001075 ** 
## zn            0.046334   0.013613   3.404 0.000719 ***
## indus         0.020562   0.061433   0.335 0.737989    
## chas          2.689026   0.859598   3.128 0.001863 ** 
## nox         -17.713540   3.679308  -4.814 1.97e-06 ***
## rm            3.814394   0.408480   9.338  < 2e-16 ***
## dis          -1.478612   0.190611  -7.757 5.03e-14 ***
## rad           0.305786   0.066089   4.627 4.75e-06 ***
## tax          -0.012329   0.003755  -3.283 0.001099 ** 
## ptratio      -0.952211   0.130294  -7.308 1.10e-12 ***
## black         0.009321   0.002678   3.481 0.000544 ***
## lstat        -0.523852   0.047625 -10.999  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.74 on 493 degrees of freedom
## Multiple R-squared:  0.7406, Adjusted R-squared:  0.7343 
## F-statistic: 117.3 on 12 and 493 DF,  p-value: < 2.2e-16

Alternatively, the update() function can be used.

#lm.fit1= update(lm.fit , ~. -age)

9.3 3.6.4 Interaction Terms

It is easy to include interaction terms in a linear model using the lm() func- tion. The syntax lstat:black tells R to include an interaction term between lstat and black. The syntax lstat*age simultaneously includes lstat, age, and the interaction term lstat×age as predictors; it is a shorthand for lstat+age+lstat:age.

summary(lm(medv ~ lstat*age, data = Boston))
## 
## Call:
## lm(formula = medv ~ lstat * age, data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.806  -4.045  -1.333   2.085  27.552 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 36.0885359  1.4698355  24.553  < 2e-16 ***
## lstat       -1.3921168  0.1674555  -8.313 8.78e-16 ***
## age         -0.0007209  0.0198792  -0.036   0.9711    
## lstat:age    0.0041560  0.0018518   2.244   0.0252 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.149 on 502 degrees of freedom
## Multiple R-squared:  0.5557, Adjusted R-squared:  0.5531 
## F-statistic: 209.3 on 3 and 502 DF,  p-value: < 2.2e-16

9.4 3.6.5 Non-linear Transformations of the Predictors

The lm() function can also accommodate non-linear transformations of the predictors. For instance, given a predictor X, we can create a predictor X2 using I(X^2). The function I() is needed since the ^ has a special meaning in a formula; wrapping as we do allows the standard usage in R, which is I() to raise X to the power 2. We now perform a regression of medv onto lstat and lstat2.

lm.fit2 = lm(medv~ lstat + I(lstat^2), data = Boston)
summary(lm.fit2)
## 
## Call:
## lm(formula = medv ~ lstat + I(lstat^2), data = Boston)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.2834  -3.8313  -0.5295   2.3095  25.4148 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 42.862007   0.872084   49.15   <2e-16 ***
## lstat       -2.332821   0.123803  -18.84   <2e-16 ***
## I(lstat^2)   0.043547   0.003745   11.63   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.524 on 503 degrees of freedom
## Multiple R-squared:  0.6407, Adjusted R-squared:  0.6393 
## F-statistic: 448.5 on 2 and 503 DF,  p-value: < 2.2e-16

The near-zero p-value associated with the quadratic term suggests that it leads to an improved model. We use the anova() function to further quantify the extent to which the quadratic fit is superior to the linear fit.

lm.fit <- lm(medv ~ lstat, data = Boston)
anova(lm.fit, lm.fit2)
## Analysis of Variance Table
## 
## Model 1: medv ~ lstat
## Model 2: medv ~ lstat + I(lstat^2)
##   Res.Df   RSS Df Sum of Sq     F    Pr(>F)    
## 1    504 19472                                 
## 2    503 15347  1    4125.1 135.2 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Here Model 1 represents the linear submodel containing only one predictor, lstat, while Model 2 corresponds to the larger quadratic model that has two predictors, lstat and lstat2. The anova() function performs a hypothesis test comparing the two models. The null hypothesis is that the two models fit the data equally well, and the alternative hypothesis is that the full model is superior. Here the F-statistic is 135 and the associated p-value is virtually zero. This provides very clear evidence that the model containing the predictors lstat and lstat2 is far superior to the model that only contains the predictor lstat. This is not surprising, since earlier we saw evidence for non-linearity in the relationship between medv and lstat. If we type

par(mfrow = c(2,2))
plot(lm.fit2)

In order to create a cubic fit, we can include a predictor of the form I(X^3). However, this approach can start to get cumbersome for higher- order polynomials. A better approach involves using the poly() function to create the polynomial within lm(). For example, the following command produces a fifth-order polynomial fit:

lm.fit5 = lm(medv ~ poly (lstat, 5), data = Boston)
summary(lm.fit5)
## 
## Call:
## lm(formula = medv ~ poly(lstat, 5), data = Boston)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.5433  -3.1039  -0.7052   2.0844  27.1153 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       22.5328     0.2318  97.197  < 2e-16 ***
## poly(lstat, 5)1 -152.4595     5.2148 -29.236  < 2e-16 ***
## poly(lstat, 5)2   64.2272     5.2148  12.316  < 2e-16 ***
## poly(lstat, 5)3  -27.0511     5.2148  -5.187 3.10e-07 ***
## poly(lstat, 5)4   25.4517     5.2148   4.881 1.42e-06 ***
## poly(lstat, 5)5  -19.2524     5.2148  -3.692 0.000247 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.215 on 500 degrees of freedom
## Multiple R-squared:  0.6817, Adjusted R-squared:  0.6785 
## F-statistic: 214.2 on 5 and 500 DF,  p-value: < 2.2e-16

Of course, we are in no way restricted to using polynomial transformations of the predictors. Here we try a log transformation.

summary(lm(medv ~ log(rm), data = Boston))
## 
## Call:
## lm(formula = medv ~ log(rm), data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -19.487  -2.875  -0.104   2.837  39.816 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -76.488      5.028  -15.21   <2e-16 ***
## log(rm)       54.055      2.739   19.73   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.915 on 504 degrees of freedom
## Multiple R-squared:  0.4358, Adjusted R-squared:  0.4347 
## F-statistic: 389.3 on 1 and 504 DF,  p-value: < 2.2e-16

9.5 3.6.6 Qualitative Predictors

We will now examine the Carseats data, which is part of the ISLR library. We will attempt to predict Sales (child car seat sales) in 400 locations based on a number of predictors.

#fix(Carseats)
#View(Carseats)
names(Carseats)
##  [1] "Sales"       "CompPrice"   "Income"      "Advertising" "Population" 
##  [6] "Price"       "ShelveLoc"   "Age"         "Education"   "Urban"      
## [11] "US"

The Carseats data includes qualitative predictors such as Shelveloc, an in- dicator of the quality of the shelving location—that is, the space within a store in which the car seat is displayed—at each location. The pre- dictor Shelveloc takes on three possible values, Bad, Medium, and Good.
Given a qualitative variable such as Shelveloc, R generates dummy variables automatically. Below we fit a multiple regression model that includes some interaction terms.

lm.fit = lm(Sales~.+Income:Advertising+Price:Age, data = Carseats)
summary(lm.fit)
## 
## Call:
## lm(formula = Sales ~ . + Income:Advertising + Price:Age, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.9208 -0.7503  0.0177  0.6754  3.3413 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         6.5755654  1.0087470   6.519 2.22e-10 ***
## CompPrice           0.0929371  0.0041183  22.567  < 2e-16 ***
## Income              0.0108940  0.0026044   4.183 3.57e-05 ***
## Advertising         0.0702462  0.0226091   3.107 0.002030 ** 
## Population          0.0001592  0.0003679   0.433 0.665330    
## Price              -0.1008064  0.0074399 -13.549  < 2e-16 ***
## ShelveLocGood       4.8486762  0.1528378  31.724  < 2e-16 ***
## ShelveLocMedium     1.9532620  0.1257682  15.531  < 2e-16 ***
## Age                -0.0579466  0.0159506  -3.633 0.000318 ***
## Education          -0.0208525  0.0196131  -1.063 0.288361    
## UrbanYes            0.1401597  0.1124019   1.247 0.213171    
## USYes              -0.1575571  0.1489234  -1.058 0.290729    
## Income:Advertising  0.0007510  0.0002784   2.698 0.007290 ** 
## Price:Age           0.0001068  0.0001333   0.801 0.423812    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.011 on 386 degrees of freedom
## Multiple R-squared:  0.8761, Adjusted R-squared:  0.8719 
## F-statistic:   210 on 13 and 386 DF,  p-value: < 2.2e-16

The contrasts() function returns the coding that R uses for the dummy variables.

attach(Carseats)
## The following objects are masked from Carseats (position 15):
## 
##     Advertising, Age, CompPrice, Education, Income, Population,
##     Price, Sales, ShelveLoc, Urban, US
contrasts(ShelveLoc)
##        Good Medium
## Bad       0      0
## Good      1      0
## Medium    0      1
?contrasts

R has created a ShelveLocGood dummy variable that takes on a value of 1 if the shelving location is good, and 0 otherwise. It has also created a ShelveLocMedium dummy variable that equals 1 if the shelving location is medium, and 0 otherwise. A bad shelving location corresponds to a zero for each of the two dummy variables. The fact that the coefficient for ShelveLocGood in the regression output is positive indicates that a good shelving location is associated with high sales (relative to a bad location). And ShelveLocMedium has a smaller positive coefficient, indicating that a medium shelving location leads to higher sales than a bad shelving location but lower sales than a good shelving location.
* 3.6.7 Writing Functions**
As we have seen, R comes with many useful functions, and still more func- tions are available by way of R libraries. However, we will often be inter- ested in performing an operation for which no function is available. In this setting, we may want to write our own function. For instance, below we provide a simple function that reads in the ISLR and MASS libraries, called LoadLibraries(). Before we have created the function, R returns an error if we try to call it.

LoadLibraries = function(){
    library(ISLR)
    library(MASS)
    print ("The libraries have been loaded.")
}

Now if we type in LoadLibraries, R will tell us what is in the function.

LoadLibraries
## function(){
##     library(ISLR)
##     library(MASS)
##     print ("The libraries have been loaded.")
## }

If we call the function, the libraries are loaded in and the print statement is output.

LoadLibraries()
## [1] "The libraries have been loaded."

3.7 Exercises
CONCEPTUAL
1. Describe the null hypothesis to which the p-values given in tables 3.4 correspond. Explain what conclusion you can draw on these p-values. your explanation should be phrased in terms of sales, TV, radio, and newspaper, rather than in terms of coefficients of the linear model. Answer:insert_here
2. Carefully explain the differences between the KNN classifier and KNN regression methods.
Answer:insert_here
3. Suppose we have a data set with five predictions,\(X_1=GPA\),\(X_2=IQ\),\(X_3=Gender\), and \(X_5=Interaction\) between GPA and Gender.